-
Notifications
You must be signed in to change notification settings - Fork 389
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[vioscsi] Fix SendSRB regression and refactor for optimum performance [viostor] Backport SendSRB improvements #1135
Conversation
…vers-windows into SendSRB-fix
Thanks a lot. Would it be possible to split this PR into two separate - WPP and SRB related ones? Can you please share the qemu command line and the performance numbers? If possible it will be nice to compare IOs/throughput numbers in relation to the number of queues, queue depth, IO size, CPU usage and IO latency for the old and new code. (However, if there is any problem with that I will try to test it by myself). Best, |
Happy to help Vadim. Can do the split. Gives me chance to fix ProcessQueue() too (https://github.com/virtio-win/kvm-guest-drivers-windows/pull/1135/files#diff-b8f1732536a5020c949209736b628a0af83022fbbded42c0eaef7bc3149dd6c3L1475-R1493)... I can probably run up some comparisons again too. I've used several different qemu command lines for various scenarios including with It's likely to be all using threads with a HBA per disk. That ok? I can share my stat collector script if that helps. Cheers, |
A few passing comments. First, fantastic work, I jumped out of bed seeing this this morning, very exciting stuff. Some house keeping, small stuff:
Past that, RE sharing scripts/cmdlines/data - yes please! more the merrier, and it will only help the overall community here get better and stronger over time. RE performance - very exciting stuff, this gets me going. When talking about it, its great to talk about test X, with IO pattern ABC with both before and after. Maybe even throw it in a simple table in markdown in the PR? Bonus points for exact incantations (like diskspd.exe -blah -here -stuff) so that others can reproduce on other configurations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again, Ben - did a first pass
|
||
if (!Srb) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious, do we ever hit this scenario where we are coming into SendSRB with a null Srb? Meaning, are there any callers that actually can do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seemed like a sanity check to me. It lives on, so maybe someone saw one in the past. We can add a WPP macro here to check and use a reserved flag and just look for that in the trace. It would take pretty extreme performance testing, and more to the point - fairly thorough fail condition testing. Maybe raise a new issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, it is a kind of sanity check. Static driver verifier was failing without this check.
} | ||
|
||
EXIT_FN_SRB(); | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need this extra indent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, nor the indent for the brace at L75.
|
||
if (!Srb) { | ||
EXIT_FN_SRB(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EXIT_FN_SRB() is an alias for debuggin, right? if so, can we align/indent appropriately here? Means it easier to read
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The question here I think is: does it stay or does it go...? Another call slows the driver down.
Maybe this is technically a macro, so normally they wouldn't be indented methinks.
@vrozenfe can probably answer this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably worth mentioning that we can probably add some conditional blocks like #ifndef TRACE_VQLOCKS
or #ifndef TRACE_SRBS
to wrap these. Then indenting might be less of an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other option here, if everyone is ok with using it would be to use goto label_at_end_of_function;
and then drop label_at_end_of_function:
just before the WPP macro call. I'm fine with that. It's really what goto
in C is supposed to be used for and cannot be used to jump out of the function like in other langs. I see it in [virtio] in some places. Many projects frown on using goto
though, so...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... goto
is also faster - which I'm guessing we all like...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, we can just drop an appropriately indented RhelDbgPrint();
and leave it at that.
If we use say RhelDbgPrint(TRACE_LEVEL_RESERVED8, " After KickPrep Notify State : %d \n", notify);
that will also work with WPP provided TRACE_LEVEL_RESERVED8
is defined in WPP_DEFINE_CONTROL_GUID()
. You can then just toggle that flag to enable or disable the trace.
Any of these work for me. What I think is most important is that there is working tracing...
RhelDbgPrint(TRACE_LEVEL_FATAL, " Could not put an SRB into a VQ, so complete it with SRB_STATUS_BUSY. QueueNumber = %d, SRB = 0x%p, Lun = %d, TimeOut = %d.\n", QueueNumber, srbExt->Srb, SRB_LUN(Srb), Srb->TimeOutValue); | ||
RhelDbgPrint(TRACE_LEVEL_WARNING, " Could not put an SRB into a VQ, so complete it with SRB_STATUS_BUSY. QueueNumber = %d, SRB = 0x%p, Lun = %d, TimeOut = %d.\n", QueueNumber, srbExt->Srb, SRB_LUN(Srb), Srb->TimeOutValue); | ||
} | ||
VioScsiLockManager(DeviceExtension, MessageID, &LockHandle, FALSE, VIOSCSI_VQLOCKOP_UNLOCK); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So is the bug here that we're simply unlocking the queue too soon, such that when there is the error path here, things can get jumbled up and blow up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe so, but probably something underlying in the virtqueue implementation too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this change doesn't just touch the error path but also the happy path (res >= 0
):
- Without this change (on 54442f0), a possible sequence of events for the happy path is
so
VioScsiVQLock -> virtqueue_add_buf -> VioScsiVQUnlock -> virtqueue_kick_prepare -> virtqueue_notify
virtqueue_kick_prepare
andvirtqueue_notify
happen outside the lock. - With this change, the sequence of events is
So with this change,
VioScsiVQLock -> virtqueue_add_buf -> virtqueue_kick_prepare -> VioScsiVQUnlock -> virtqueue_notify
virtqueue_kick_prepare
happens inside the lock,virtqueue_notify
happens outside the lock.
I'm no VirtIO expert so I can't really say whether virtqueue_kick_prepare
is supposed to happen inside or outside the lock. But FWIW, to me it looks like the Linux virtio scsi driver also kicks inside the lock and notifies outside the lock:
Also, it looks like until and including 7dc052d, the virtqueue_kick_prepare
used to happen inside the lock:
kvm-guest-drivers-windows/vioscsi/helper.c
Line 112 in 7dc052d
notify = virtqueue_kick_prepare(adaptExt->vq[QueueNumber]) ? TRUE : notify; |
and happens outside the lock starting with f1338bb (this commit is also mentioned by @benyamin-codez)
kvm-guest-drivers-windows/vioscsi/helper.c
Line 109 in f1338bb
if (virtqueue_kick_prepare(adaptExt->vq[QueueNumber])) { |
Could the early unlock be the cause for the vioscsi "Reset to device \Device\RaidPortN" + IO lockups?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could the early unlock be the cause for the vioscsi "Reset to device \Device\RaidPortN" + IO lockups?
Yes, under high load conditions, this could certainly be a trigger.
It's probably worth pointing out that virtqueue_kick_prepare()
and virtqueue_kick()
are two different things. An actual kick includes the virtqueue_notify()
, which executes the notification callback on the VQ.
Here is the latter from VirtIOPCICommon.c
:
void virtqueue_notify(struct virtqueue *vq)
{
vq->notification_cb(vq);
}
void virtqueue_kick(struct virtqueue *vq)
{
if (virtqueue_kick_prepare(vq)) {
virtqueue_notify(vq);
}
}
However, virtqueue_kick_prepare()
- which in our case calls virtqueue_kick_prepare_split()
- prepares the buffer and returns TRUE if the device needs to be notified. Here is the return comparator for reference:
if (_vq->vdev->event_suppression_enabled) {
return wrap_around || (bool)vring_need_event(vring_avail_event(&vq->vring), new, old);
} else {
return !(vq->vring.used->flags & VIRTQ_USED_F_NO_NOTIFY);
}
...and:
/* The Host uses this in used->flags to advise the Guest: don't kick me when
* you add a buffer. It's unreliable, so it's simply an optimization. Guest
* will still kick if it's out of buffers. */
#define VIRTQ_USED_F_NO_NOTIFY 1
It also bears mentioning, that even when a notification callback is required it is completely up to the device as to what it will do. It may well just ignore the kicks...
RhelDbgPrint(TRACE_LEVEL_WARNING, " Could not put an SRB into a VQ, so complete it with SRB_STATUS_BUSY. QueueNumber = %d, SRB = 0x%p, Lun = %d, TimeOut = %d.\n", QueueNumber, srbExt->Srb, SRB_LUN(Srb), Srb->TimeOutValue); | ||
} | ||
VioScsiLockManager(DeviceExtension, MessageID, &LockHandle, FALSE, VIOSCSI_VQLOCKOP_UNLOCK); | ||
if (notify){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would we not want to kick the underlying backend in the failure case?
In the QUEUE_FULL branch above, notify will be false, yea? In which case, we wont send a kick. I'd imagine we would want to kick in unilaterally, just in case it needs to be hit with a hammer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think there was a notify = TRUE;
at the foot of the QUEUE_FULL branch previously. I didn't notice any change with or without it in testing. I can have another go around though, but it could be in virtqueue implementation (which I'm presently looking at).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it appears that res will only ever (famous last words) be a 0 (SUCCESS) or a 28 (ENOSPC: No space left on device), i.e. SCSISTAT_QUEUE_FULL. The test should really be if (res == 0) {}
...
As no data was added to a buffer, does it really need any further action?
You might observe in my reply to @frwbr above that a call to virtqueue_kick()
just calls virtqueue_kick_prepare()
and so a virtqueue_notify(vq)
won't happen without the former returning 0.
We could call a virtqueue_notify(vq)
, but the device won't find anything there.
My thoughts were that it would just consume compute resources and time.
if (res >= 0) { | ||
notify = virtqueue_kick_prepare(adaptExt->vq[QueueNumber]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, in this case, we're only //potentially// kicking the underlying device if we're successful in enqueueing work.
This function also has a membarrier, which are are now doing under the lock, vs outside the lock in the previous code
I certainly love it when stuff is faster, but it isn't immediately clearer to me why this code would be more performant than the existing code. Is the idea that we're changing the amount of kicks happening for the better? Would love to know your observations here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was the breaking change following the v208 package.
This is where I'm now focusing my efforts per update in #756...
The underlying issue does certainly seem to be about managing virtqueue notifications...
@@ -185,6 +191,7 @@ DeviceReset( | |||
|
|||
ENTER_FN(); | |||
if (adaptExt->dump_mode) { | |||
EXIT_FN(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto on my comments about indentation if we can get away with it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, let's wait to see what Vadim says.
@@ -384,46 +393,48 @@ ENTER_FN(); | |||
SP_INTERNAL_ADAPTER_ERROR, | |||
__LINE__); | |||
RhelDbgPrint(TRACE_LEVEL_FATAL, " CANNOT READ PCI CONFIGURATION SPACE %d\n", pci_cfg_len); | |||
EXIT_FN(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this whole part should be a separate PR. Not that its bad, just seems orthogonal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue was that in the trace I could see entry into the functions, but when leaving early, it indicated a potential problem, because the corresponding exit trace was missing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Jon, if you were actually talking about the section below, it was because it was in orphaned braces.
Thanks for the contribution and review Jon. I'll take the housekeeping on board and I'm happy to share my stat collector scripts and results too. Give me a couple of days - I have some other priorities to deal with... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot @benyamin-codez for your efforts and for submitting this PR!
If I understand correctly this PR combines a possible fix for the vioscsi "Reset to device \Device\RaidPortN was issued" messages + IO lockups with refactoring to improve performance and WPP instrumentation (it was already suggested to split up the PR in smaller ones). Personally I'm most interested in the "Reset to device" fix (though performance improvements are of course also nice). So I tested the reproducer on current master and with this PR applied:
- On 54442f07 the reproducer produces the vioscsi warnings pretty reliably (>=9 out of 10 runs)
- With this PR applied on top (benyamin-codez@976c325) the reproducer has not produced vioscsi warnings so far!
So this looks promising, but I'd like to understand the why -- I've also added a comment inline.
Thanks again!
RhelDbgPrint(TRACE_LEVEL_FATAL, " Could not put an SRB into a VQ, so complete it with SRB_STATUS_BUSY. QueueNumber = %d, SRB = 0x%p, Lun = %d, TimeOut = %d.\n", QueueNumber, srbExt->Srb, SRB_LUN(Srb), Srb->TimeOutValue); | ||
RhelDbgPrint(TRACE_LEVEL_WARNING, " Could not put an SRB into a VQ, so complete it with SRB_STATUS_BUSY. QueueNumber = %d, SRB = 0x%p, Lun = %d, TimeOut = %d.\n", QueueNumber, srbExt->Srb, SRB_LUN(Srb), Srb->TimeOutValue); | ||
} | ||
VioScsiLockManager(DeviceExtension, MessageID, &LockHandle, FALSE, VIOSCSI_VQLOCKOP_UNLOCK); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this change doesn't just touch the error path but also the happy path (res >= 0
):
- Without this change (on 54442f0), a possible sequence of events for the happy path is
so
VioScsiVQLock -> virtqueue_add_buf -> VioScsiVQUnlock -> virtqueue_kick_prepare -> virtqueue_notify
virtqueue_kick_prepare
andvirtqueue_notify
happen outside the lock. - With this change, the sequence of events is
So with this change,
VioScsiVQLock -> virtqueue_add_buf -> virtqueue_kick_prepare -> VioScsiVQUnlock -> virtqueue_notify
virtqueue_kick_prepare
happens inside the lock,virtqueue_notify
happens outside the lock.
I'm no VirtIO expert so I can't really say whether virtqueue_kick_prepare
is supposed to happen inside or outside the lock. But FWIW, to me it looks like the Linux virtio scsi driver also kicks inside the lock and notifies outside the lock:
Also, it looks like until and including 7dc052d, the virtqueue_kick_prepare
used to happen inside the lock:
kvm-guest-drivers-windows/vioscsi/helper.c
Line 112 in 7dc052d
notify = virtqueue_kick_prepare(adaptExt->vq[QueueNumber]) ? TRUE : notify; |
and happens outside the lock starting with f1338bb (this commit is also mentioned by @benyamin-codez)
kvm-guest-drivers-windows/vioscsi/helper.c
Line 109 in f1338bb
if (virtqueue_kick_prepare(adaptExt->vq[QueueNumber])) { |
Could the early unlock be the cause for the vioscsi "Reset to device \Device\RaidPortN" + IO lockups?
Thanks for dropping your post above, Friedrich, and for taking the time to build and test it out. For those interested, check out my inline reply. |
@vrozenfe @JonKohler @sb-ntnx @frwbr @foxmox Please bear in mind that despite fixing the issue mentioned in #756 the performance is not reliable. Given I've got limited time over the next few days, and I'm not aware of the timing for the next release, would you prefer I spend the time I do have on submitting the new SendSRB fix PR for review or hunt down the issue affecting performance first. Personally, I'm happy to continue working on it here until it's ready for prime time. One reason for this position is that I don't think it is as stable as the 100.85.104.20800 version. The extra performance the fix permits has affected stability and reliability. Some pictures might help explain what I mean... Each of the following show the end point of multiple runs of my stat collector. There should be 6 distinct peaks in I/O for each test run; the same for CPU... All are using This one is running just the SRB fix (this one doesn't show an example of CPU utilisation - my bad 8^d ): This one is version 100.85.104.20800: This one was a test running the code in this draft PR, but with the optimisations for This one is the same, but is running two HBAs with This last one might appear more reliable, but this is because it is only operating with throughput of 9 to 17% of the others which are all single HBA with Also, re the stat collector scripts, can someone suggest a suitable place in the tree, e.g. /vioscsi/stat-collector or /Tools/stat-collector or similar? I'll then do a PR for them too. Let me know... |
Fail on Disk Stress (LOGO). Not surprised. 8^( |
|
||
element = &adaptExt->processing_srbs[QueueNumber]; | ||
VioStorVQLock(DeviceExtension, MessageId, &LockHandle, FALSE); | ||
if (virtqueue_add_buf(vq, | ||
&srbExt->sg[0], | ||
srbExt->out, srbExt->in, | ||
&srbExt->vbr, va, pa) >= 0) { | ||
res = virtqueue_add_buf(vq, | ||
&srbExt->sg[0], | ||
srbExt->out, srbExt->in, | ||
&srbExt->vbr, va, pa); | ||
|
||
if (res >= 0) { | ||
notify = virtqueue_kick_prepare(vq); | ||
element = &adaptExt->processing_srbs[QueueNumber]; | ||
InsertTailList(&element->srb_list, &srbExt->vbr.list_entry); | ||
element->srb_cnt++; | ||
VioStorVQUnlock(DeviceExtension, MessageId, &LockHandle, FALSE); | ||
#ifdef DBG | ||
InterlockedIncrement((LONG volatile*)&adaptExt->inqueue_cnt); | ||
#endif | ||
result = TRUE; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moving the element = &adapt->processing_srbs[QueueNumber];
to after the virtqueue_kick_prepare()
might be an issue here... I will refactor and retest...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the same for [vioscsi]... I will check there first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Already done in my present [vioscsi] WIP... Will look at [viostor] once [vioscsi] is stable.
//
// Moved ouside conditional so (res != VQ_ADD_BUFFER_SUCCESS) can access it for RemoveTailList(&element->srb_list)
element = &adaptExt->processing_srbs[index];
//
// res, i.e. virtqueue_add_buf() will only
// return 0 () or 28 (ENOSPC), i.e. No space left on device.
// Therefore setting comparator from >= to ==
// Also defined VQ_ADD_BUFFER_SUCCESS (as 0) in helper.h
if (res == VQ_ADD_BUFFER_SUCCESS) {
//
// Kicker returns like this:
//
// if (_vq->vdev->event_suppression_enabled) {
// return wrap_around || (bool)vring_need_event(vring_avail_event(&vq->vring), new, old);
// } else {
// return !(vq->vring.used->flags & VIRTQ_USED_F_NO_NOTIFY);
// }
//
/* The Host uses this in used->flags to advise the Guest: don't kick me when
* you add a buffer. It's unreliable, so it's simply an optimization. Guest
* will still kick if it's out of buffers.
*
#define VIRTQ_USED_F_NO_NOTIFY 1
*/
// So ....
//
notify = virtqueue_kick_prepare(adaptExt->vq[QueueNumber]);
RhelDbgPrint(TRACE_LEVEL_RESERVED8, " After KickPrep Notify State : %d \n", notify);
//
// Moved ouside conditional so else can access
//element = &adaptExt->processing_srbs[index];
InsertTailList(&element->srb_list, &srbExt->list_entry);
element->srb_cnt++;
} else {
//
// So to reiterate:
// This only gets hit if virtqueue_add_buf() returns 28, i.e. the buffer is full.
//
// Put the DbgPrint above rest to ensure we get problem data. Also split the long line.
RhelDbgPrint(TRACE_LEVEL_FATAL, " Could not put an SRB into a VQ, so complete it with SRB_STATUS_BUSY. QueueNumber = %d, SRB = 0x%p, Lun = %d, TimeOut = %d.\n"\
, QueueNumber, srbExt->Srb, SRB_LUN(Srb), Srb->TimeOutValue);
//
// Adding equivalent for InsertTailList() as the
// working version 100.85.104.20800 driver
// had ExInterlockedInsertHeadList() here
// ---> NOT required
//RemoveTailList(&element->srb_list);
ScsiStatus = SCSISTAT_QUEUE_FULL;
SRB_SET_SRB_STATUS(Srb, SRB_STATUS_BUSY);
SRB_SET_SCSI_STATUS(Srb, ScsiStatus);
StorPortBusy(DeviceExtension, 10);
CompleteRequest(DeviceExtension, Srb);
//
// Don't think these are necessary as the buffer is full.
//notify = TRUE;
//virtqueue_notify(adaptExt->vq[QueueNumber]);
//
// ...but do we need this...? CompleteRequest() takes
// the SRB and finalises it, so best to wipe ours methinks...
//
srbExt = NULL;
}
FWIW, I have run my reproducer against a vioscsi build with this PR applied on top of 54442f0 a few more times, and still haven't seen the vioscsi "Reset to device \Device\RaidPortN was issued" messages + IO lockups from #756 so far. I have also applied only the hunk I suspect to be responsible (diff to 54442f0 is here) and have not seen the issue either.
IMHO, if this PR indeed contains a fix for the IO lockup issues from #756 (this would be the "SendSRB fix" you mention?), it would be amazing to get the fix in soon. So, if time permits, it would be great if you could open an isolated PR for the SendSRB fix. If the fix is indeed in this hunk, I'd expect the PR to be relatively small and thus easier to review than a combination of multiple things. Also, I'd hope the fix wouldn't have a huge performance impact (but this is just hope speaking, I haven't actually run any performance tests myself). Though the other improvements to performance and tracing from this PR also sound promising, in comparison to the IO lockups they seem to be of lower priority. But this is just my opinion, YMMV :)
Can you elaborate what you mean by "stability and reliability" here? Do you mean the performance is not as uniform over time (more "bursty")?
Sorry, I'm not sure what I should be looking at in the screenshots and how to compare the different results. Could you explain in more detail what setup was tested in which scenario (maybe with the code git revision and QEMU command line), and share the diskspd outputs?
Which version of the code fails, and in which sense does it fail? Again, many thanks for you work! |
Agreed - stability and lockup style bugs must come before optimization type work. If we can isolate that, let's peel that to a separate very skinny commit/PR and get that cooking first |
@JonKohler @frwbr, I agree with you both. I think we are all on the same page. I've got some answers to your questions Friedrich. I'm pretty sure I'm on a path now to getting it sorted. I've got to step away for a couple of hours, but I'll post here as soon as I can. In a meantime, a picture to show what has to be compared. I'm sure you will agree that makes stats pretty useless unless you do relatively short stints over a very long period. |
Great! I myself will be away from computers for ~2 weeks now, but will be happy to review/test when I'm back. Again, thanks for your work! |
For those watching: I've spent a couple more hours tracing this through, and it looks like we have at least three issues. First, the issue of notifications. That seems to be solved for the most part with fixes already in this PR. Second, there are some not insignificant issues with spinlock management. This is what I am focused on at the moment. If it wasn't obvious to anyone, we only issue kvm-guest-drivers-windows/vioscsi/helper.c Lines 527 to 541 in 54442f0
That's an issue, but interestingly, if we fix that, we can reliably reproduce kvm-guest-drivers-windows/vioscsi/helper.c Lines 63 to 64 in 54442f0
I should mention that using MSI-X based spinlocks don't appear to cause any errors - noting I haven't properly stress tested them yet - but graphical representations of the I/O they produce look problematic and erratic. They also didn't immediately solve the irregularities in performance. We use The third issue looks to be related to be the use of virtqueue assignment. It looks like there are several issues here which I am slowly unpacking... It appears in some places it is possible that a virtqueue might be replaced mid-process upon entering a new function. This could be another cause of The work continues in a limited fashion for the next few days... |
For those watching: I think I'm almost done and hope to provide a report here in the next 12 to 24 hours. I'll then split up the commits into new PRs. The performance reliability issue turned out to be a bit of red herring. Proper preparation of the the |
A somewhat belated progress report: So I've had trouble reproducing faults with It suggests there might be one or more race conditions at play when dealing with zeroed files / data. This could also potentially be a source for the In relation to my fixes: apart from the initial notification fix in It's noteworthy that AFAICT, we really only do DPC over MSI-X now. It would seem that a call to use an I also have a large amount of refactoring, much of which is perhaps of little benefit, but might be of some use. Some of the useful stuff is probably my tracing and instrumentation improvements. I will probably drop a branch on my fork with all my edits if anyone is interested. At the moment I am rebasing this work against master to produce smaller, targeted commits for new PRs. I'm pretty sure I also fixed a couple of orphaned sections from previous squashings, which this work will find (provided I didn't create the orphans from my own refactoring). I did check virtqueue assignments and also MSI-X CPU affinity, and found them to be working correctly in a wide variety of operational scenarios. In my refactored solution, I did implement I'm just making one last revisit to To that end, can someone please confirm from a trace what they get during @JonKohler and @frwbr (and anyone else so inclined), are you able to help with this? @vrozenfe, is there any reason we don't enable the following? Hope to wrap this up in short order... 8^d |
Technically enabling STOR_PERF_DPC_REDIRECTION_CURRENT_CPU and STOR_PERF_OPTIMIZE_FOR_COMPLETION_DURING_STARTIO work quite well for storport miniport drivers can improve performance a bit more. I tested it a bit on viostor (virtio-blk) driver a long time ago but didn't add the relevant code to the virtio-scsi driver. In any case to make STOR_PERF_OPTIMIZE_FOR_COMPLETION_DURING_STARTIO code working we need to have some per-cpu list to keep SRBs, extend StartIo routine a bit, and make sure that we complete those SRBs in the per-cpu queues in case of bus/lun reset. Regarding Dma32BitAddresses. On a real hardware HBA it should be a must to let the DMA engine to work properly. I don't know it enabling or disabling this bit will have any implication for qemu virtio, never tired to turn it off, but I know for sure that some HW vendors run this code on top of FPGA implemented virtio adaptors and this is the reason why we will keep it turned on :) |
Hi @benyamin-codez! Unfortunately I know very little about Windows internals, so can't comment on your observations w.r.t. the different types of spin locks.
This is quite interesting. I haven't run the reproducer with all-zero test files so far, only with random files. Do I understand correctly these faults are not fixed by the notifications patch?
Here is the trace: --> VioScsiFindAdapter. --> VioScsiWmiInitialize. <-- VioScsiWmiInitialize. --> InitHW. MessageControl.TableSize = 6 MessageControl.FunctionMask = 0 MessageControl.MSIXEnable = 1 MessageTable = 1 PBATable = 2049 CapabilityID = 9, Next CapOffset = 84 CapabilityID = 9, Next CapOffset = 70 CapabilityID = 9, Next CapOffset = 60 CapabilityID = 9, Next CapOffset = 50 CapabilityID = 9, Next CapOffset = 40 msix_enabled = 1 <-- InitHW. --> GetScsiConfig. seg_max 254 num_queues 4 max_sectors 65535 cmd_per_lun 128 event_info_size 16 sense_size 96 cdb_size 32 max_channel 0 max_target 255 max_lun 16383 <-- GetScsiConfig. --> SetGuestFeatures. <-- SetGuestFeatures. StorPortRegistryRead returned 0x0, Len = 0 NumberOfPhysicalBreaks 513 MaximumTransferLength 2097152 StorPortRegistryRead returned 0x0, Len = 0 StorPortRegistryRead returned 0x0, Len = 0 Queues 4 CPUs 4 breaks_number = 201 queue_depth = 100 StorPortGetUncachedExtension uncachedExtensionVa = FFFF9C89A0F49000 allocation size = 84096 Page-aligned area at FFFF9C89A0F49000, size = 49152 Pool area at FFFF9C89A0F55000, size = 30848 pmsg_affinity = 0000000000000000 pmsg_affinity = FFFF9C89A0EEB510 Status = 0 <-- VioScsiFindAdapter. The VM has only one disk attached via SCSI, it is a raw disk and on the host side it is an LVM Logical Volume ( Regarding the max_sectors=65535 -- could this simply be the default value set by QEMU here? Hope this helps. Happy to provide more data if needed. |
Thanks @frwbr !
The patch works. When using an unpatched version, I found it more difficult to produce the fault with random data, so perhaps YMMV. Maybe try with test files prepared with
It is, and This is mostly performance related, as setting the correct NOPB ( The following table should be insightful:
In your NVMe scenario, you should have aligned values of So I've worked on this a bit and I have a new solution, in part it looks like this: EDIT: ULONG nopb_candidate[3] = { 0 };
ULONG max_segments_b4_alignment;
//...
if (!adaptExt->dump_mode) {
/* Allow user to override max_physical_breaks via reg key
* [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\vioscsi\Parameters\Device]
* "PhysicalBreaks"={dword value here}
*
* *** This should be VIRTIO_MAX_SG - 1, approximated by the maximum number of memory pages (typ. 4KiB each) - 1 ***
*/
if (VioScsiReadRegistryParameter(DeviceExtension, REGISTRY_MAX_PH_BREAKS, FIELD_OFFSET(ADAPTER_EXTENSION, max_physical_breaks))) {
/* We +1 to convert to segments from NOPB */
adaptExt->max_segments = adaptExt->max_physical_breaks + 1;
#if !defined(RUN_UNCHECKED)
RhelDbgPrint(TRACE_LEVEL_VERBOSE, " max_physical_breaks candidate was specified in the registry : %lu | max_segments : %lu \n",
adaptExt->max_physical_breaks, adaptExt->max_segments);
#endif
} else {
/* Grab the VirtIO reported maximum SEGMENTS value from the HBA and put it somewhere mutable */
adaptExt->max_segments = adaptExt->scsi_config.seg_max;
#if !defined(RUN_UNCHECKED)
RhelDbgPrint(TRACE_LEVEL_VERBOSE, " max_physical_breaks candidate was NOT specified in the registry. We will attempt to derive the value...\n");
#endif
}
/* Use our maximum SEGMENTS value OR use PHYS_SEGMENTS... */
nopb_candidate[1] = (adaptExt->indirect) ? (adaptExt->max_segments - 1) : (PHYS_SEGMENTS - 1);
#if !defined(RUN_UNCHECKED)
RhelDbgPrint(TRACE_LEVEL_VERBOSE, " max_physical_breaks candidate derived from MAX SEGMENTS : %lu \n", nopb_candidate[1]);
#endif
/* Grab the VirtIO reported maximum SECTORS value from the HBA to start with */
nopb_candidate[2] = (adaptExt->scsi_config.max_sectors * SECTOR_SIZE / PAGE_SIZE) - 1;
#if !defined(RUN_UNCHECKED)
RhelDbgPrint(TRACE_LEVEL_VERBOSE, " max_physical_breaks candidate derived from MAX SECTORS (QEMU/KVM hint) : %lu \n", nopb_candidate[2]);
#endif
/* Start with a comparison of equality */
if (nopb_candidate[1] == nopb_candidate[2]) {
nopb_candidate[0] = nopb_candidate[1];
#if !defined(RUN_UNCHECKED)
RhelDbgPrint(TRACE_LEVEL_VERBOSE, " nopb_candidate[0] : init - the candidates were the same value : %lu \n", nopb_candidate[0]);
#endif
} else if (nopb_candidate[2] > 0 && nopb_candidate[2] < (MAX_PHYS_SEGMENTS - 1)) {
nopb_candidate[0] = nopb_candidate[2];
#if !defined(RUN_UNCHECKED)
RhelDbgPrint(TRACE_LEVEL_VERBOSE, " nopb_candidate[0] : init - the QEMU/KVM hint method (scsi_config.max_sectors) was used to select the candidate : %lu \n", nopb_candidate[0]);
#endif
} else {
/* Take the smallest candidate */
nopb_candidate[0] = min((nopb_candidate[1]), (nopb_candidate[2]));
#if !defined(RUN_UNCHECKED)
RhelDbgPrint(TRACE_LEVEL_VERBOSE, " nopb_candidate[0] : init - the smallest candidate was selected : %lu \n", nopb_candidate[0]);
#endif
}
/* Check the value is within SG list bounds */
nopb_candidate[0] = min(max(SCSI_MINIMUM_PHYSICAL_BREAKS, nopb_candidate[0]), (VIRTIO_MAX_SG - 1));
#if !defined(RUN_UNCHECKED)
RhelDbgPrint(TRACE_LEVEL_VERBOSE, " nopb_candidate[0] : within SG list bounds : %lu\n", nopb_candidate[0]);
#endif
/* Check the value is within physical bounds */
nopb_candidate[0] = min(max(SCSI_MINIMUM_PHYSICAL_BREAKS, nopb_candidate[0]), (MAX_PHYS_SEGMENTS - 1));
#if !defined(RUN_UNCHECKED)
RhelDbgPrint(TRACE_LEVEL_VERBOSE, " nopb_candidate[0] : within physical bounds : %lu\n", nopb_candidate[0]);
#endif
/* Update max_segments for all cases */
adaptExt->max_segments = nopb_candidate[0] + 1;
max_segments_b4_alignment = adaptExt->max_segments;
/* Do byte alignment (using integer division) if necessary */
if (max_segments_b4_alignment > (PAGE_SIZE / SECTOR_SIZE)) {
adaptExt->max_physical_breaks = (((max_segments_b4_alignment / (PAGE_SIZE / SECTOR_SIZE)) * (PAGE_SIZE / SECTOR_SIZE)) - 1);
if (max_segments_b4_alignment != (adaptExt->max_physical_breaks + 1)) {
adaptExt->max_segments = adaptExt->max_physical_breaks + 1;
}
#if !defined(RUN_UNCHECKED)
RhelDbgPrint(TRACE_LEVEL_VERBOSE,
" Sector byte alignment : SECTOR_SIZE = %lu Bytes, PAGE_SIZE = %lu KiB, max_segments : original = %lu, aligned = %lu, max_physical_breaks : original = %lu, aligned = %lu \n",
SECTOR_SIZE, (PAGE_SIZE / 1024), max_segments_b4_alignment, adaptExt->max_segments, nopb_candidate[0], adaptExt->max_physical_breaks);
#endif
}
}
ConfigInfo->NumberOfPhysicalBreaks = adaptExt->max_physical_breaks;
/* MaximumTransferLength should be calculated from segments not breaks... */
ConfigInfo->MaximumTransferLength = adaptExt->max_segments * PAGE_SIZE;
#if !defined(RUN_UNCHECKED)
RhelDbgPrint(TRACE_LEVEL_INFORMATION, " NumberOfSegments : %lu | NumberOfPhysicalBreaks : %lu | MaximumTransferLength : %lu Bytes (%lu KiB) \n",
(ConfigInfo->NumberOfPhysicalBreaks + 1),
ConfigInfo->NumberOfPhysicalBreaks,
ConfigInfo->MaximumTransferLength,
(ConfigInfo->MaximumTransferLength / 1024));
#endif This is only part of the solution, The problem becomes that even if enabling per HBA registry settings by modding The .../queue/max_segments:168, .../queue/max_hw_sectors_kb:32767 devices are SATA3 HDDs I can only set NOPB for the lowest I am presently working on this but it might be a fundamental architectural barrier. I have a few suspects, like they all share the same @vrozenfe, can you provide a hint as to how to get each HBA instance to act separately, with it's own memory allocation and NOPB settings...? I also have fixes for memory allocation, including for |
FYI there is a RH Jira issue reported internally https://issues.redhat.com/browse/RHEL-56722 and which I'm trying to fix soon. |
Per my comments in my last post above. Could the fact using a per HBA scope in Normally these would be Parameters\Device0, Parameters\Device1, and so on... We seem to have -1 instead... Thanks for the hints in your last post here. My observations:
From what I could tell, this seems to be already taken care of by STOR_PERF_DPC_REDIRECTION_CURRENT_CPU and
I found enabling this manually (StorPort does it automagically) resulted in a loss of performance. I came up with this: /* NOTE: When unset we get -5k IOPS +30us latency (best case)...! */
ConfigInfo->Master = TRUE; // +7k IOPS -50us latency
ConfigInfo->ScatterGather = TRUE; // +12k IOPS -75us latency
/* Allow user to force use of Dma32BitAddresses via reg key
* [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\vioscsi\Parameters\Device]
* "UseDma32BitAddresses"={any dword value here - the value is ignored}
*
* WARNING: Manually setting this increases latency and reduces IOPS
*
* NOTE: StorPort normally sets this to TRUE anyway.
* So let StorPort do it for maximum performance.
* Only provided in the event StorPort does not enable the feature and it is required.
*/
if (VioScsiReadRegistryParameter(DeviceExtension, REGISTRY_USE_DMA32BITADDRESSES, FIELD_OFFSET(ADAPTER_EXTENSION, use_Dma32BitAddresses))) {
#if !defined(RUN_UNCHECKED)
RhelDbgPrint(TRACE_LEVEL_VERBOSE, " REGISTRY_USE_DMA32BITADDRESSES was FOUND in the registry. We will set ConfigInfo->Dma32BitAddresses to TRUE. \n");
#endif
ConfigInfo->Dma32BitAddresses = TRUE; // -15k IOPS +100us latency (worst case)
} else {
#if !defined(RUN_UNCHECKED)
RhelDbgPrint(TRACE_LEVEL_VERBOSE, " REGISTRY_USE_DMA32BITADDRESSES was NOT FOUND in the registry. We will let StorPort manage the Dma32BitAddresses setting. \n");
#endif
}
/*
* WARNING: Do not set this.
* All of these options increase latency and reduce IOPS:
*
ConfigInfo->DmaWidth = 0; //<-- This should be zero as initialised by StorPort
ConfigInfo->DmaWidth = Width32Bits; //<-- This should be zero as initialised by StorPort
ConfigInfo->DmaWidth = Width64Bits; //<-- This should be zero as initialised by StorPort
*/ The I also implemented some tracing with I'm not sure what this would mean for packaging, but the performance gains of the above are not negligible. I've seen improvements with 4KiB random I/O of upto 50k IOPS and latency reduced by up to 250us. In any case, my plan is to sort out the per HBA stuff (one way or the other), then I'll drop the whole lot in a branch of my fork to split into new branches and make PRs from. In the interim, I hope to split the notification fix in |
@benyamin-codez Best regards, |
To not rely on the PhysicalBreaks registry entry will require QEMU changes. By the way, I found the Parameters\Device-1 by issuing a |
Considering further where a good spot in the tree might be, perhaps one of the following would work better: The script can then be called This might also be a good location for any |
My bad on the rename. Maybe kill those checks.... 8^d |
A PR for the SendSRB notification fix has been raised: #1150 |
A PR for two DPC fixes has been raised: #1152 |
A PR for minor SendSRB refactoring has been raised: #1153 |
For anyone concerned that performance has diminished from earlier tests published in other places, please be aware that the results above were on a shared disk with other simultaneous I/O occurring. What is important is the comparison with the v208 (100.85.104.20800) driver. Here is a comparison on a dedicated NVMe disk: It is also important to realise that my NOPB references are likely in error. I will explain this in the next post here. wrt to unchecked vs. checked build comparisons, the v208 version of the driver has very little tracing during I/O operations. My checked build has about 50% more tracing at the start and much, much more during any I/O. |
Regarding It's noteworthy that both the SCSI Port and StorPort documentation is incorrect wrt NOPB: Specifies the maximum number of breaks between address ranges that a data buffer can have if the HBA supports scatter/gather. In other words, the number of scatter/gather lists minus one. By default, the value of this member is SP_UNINITIALIZED_VALUE, which indicates the HBA can support an unlimited number of physical discontinuities. If the port driver sets a value for this member, the miniport driver can adjust the value lower but no higher. If this member is SP_UNINITIALIZED_VALUE, the miniport driver must reset this member according to the HBA's scatter/gather capacity, with zero representing no scatter/gather support. When in fact SCSI Port needs: Maximum number of physical pages the storage adapter can manage in a single transfer (in other words, the extent of its scatter/gather support). By default, the value of this member is 0x11. The miniport driver must reset this member according to the storage adapter's capability. When in fact StorPort needs: This thread is also telling: microsoft/Windows-driver-samples#6 (comment) So whilst the documentation says one thing, the off-by-one behaviour required by StorPort is to maintain backwards compatibility. Some more links (plus how this relates to In my refactoring of this part of Whilst under the mistaken belief that NOPB should be what the documentation says, I worked around this by using I've made the following changes:
#define VIRTIO_MAX_SG (PHYS_SEGMENTS_LIMIT+1)
#define REGISTRY_MAX_PH_SEGMENTS "MaxPhysicalSegments"
#define REGISTRY_FACTOR8_REMAP "PerformFactor8Remap"
#define VIRTIO_SCSI_QUEUE_LAST (MAX_CPU - VIRTIO_SCSI_REQUEST_QUEUE_0) ..and my WIP now uses: if (!adaptExt->dump_mode) {
/* Allow user to override max_segments via reg key
* [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\vioscsi\Parameters\Device]
* "MaxPhysicalSegments"={dword value here}
* OR the legacy value:
* "PhysicalBreaks"={dword value here}
*
* ATTENTION: This should be the maximum number of memory pages (typ. 4KiB each) in a transfer
* Equivalent to any of the following:
* NumberOfPhysicalBreaks - 1 (NOPB includes known off-by-one error)
* VIRTIO_MAX_SG - 1
* MaximumSGList - 1 (SCSI Port legacy value)
*/
if ((VioScsiReadRegistryParameter(DeviceExtension, REGISTRY_MAX_PH_BREAKS, FIELD_OFFSET(ADAPTER_EXTENSION, max_segments))) ||
(VioScsiReadRegistryParameter(DeviceExtension, REGISTRY_MAX_PH_SEGMENTS, FIELD_OFFSET(ADAPTER_EXTENSION, max_segments)))) {
/* Grab the maximum SEGMENTS value from the registry */
#if !defined(RUN_UNCHECKED)
RhelDbgPrint(TRACE_LEVEL_VERBOSE, " max_segments candidate was specified in the registry : %lu \n", adaptExt->max_segments);
#endif
} else {
/* Grab the VirtIO reported maximum SEGMENTS value from the HBA and put it somewhere mutable */
adaptExt->max_segments = adaptExt->scsi_config.seg_max;
#if !defined(RUN_UNCHECKED)
RhelDbgPrint(TRACE_LEVEL_VERBOSE, " max_segments candidate was NOT specified in the registry. We will attempt to derive the value...\n");
#endif
}
/* Use our maximum SEGMENTS value OR use PHYS_SEGMENTS... */
if (adaptExt->indirect) {
max_segs_candidate[1] = adaptExt->max_segments;
#if !defined(RUN_UNCHECKED)
RhelDbgPrint(TRACE_LEVEL_VERBOSE, " max_segments candidate derived from MAX SEGMENTS (as reported by QEMU/KVM) : %lu \n", max_segs_candidate[1]);
#endif
} else {
max_segs_candidate[1] = PHYS_SEGMENTS;
#if !defined(RUN_UNCHECKED)
RhelDbgPrint(TRACE_LEVEL_VERBOSE, " max_segments candidate derived from PHYS_SEGMENTS : %lu \n", max_segs_candidate[1]);
#endif
}
/* Grab the VirtIO reported maximum SECTORS value from the HBA to start with */
max_segs_candidate[2] = (adaptExt->scsi_config.max_sectors * SECTOR_SIZE) / PAGE_SIZE;
#if !defined(RUN_UNCHECKED)
RhelDbgPrint(TRACE_LEVEL_VERBOSE, " max_segments candidate derived from MAX SECTORS (QEMU/KVM hint per VirtIO standard) : %lu \n", max_segs_candidate[2]);
#endif
/* Choose the best candidate... */
if (max_segs_candidate[1] == max_segs_candidate[2]) {
/* Start with a comparison of equality */
max_segs_candidate[0] = max_segs_candidate[1];
#if !defined(RUN_UNCHECKED)
RhelDbgPrint(TRACE_LEVEL_VERBOSE, " max_segs_candidate[0] : init - the candidates were the same value : %lu \n", max_segs_candidate[0]);
#endif
} else if ((max_segs_candidate[2] > 0) && (max_segs_candidate[2] < PHYS_SEGMENTS_LIMIT)) {
/* Use the value derived from the QEMU/KVM hint if it is below the PHYS_SEGMENTS_LIMIT */
max_segs_candidate[0] = max_segs_candidate[2];
#if !defined(RUN_UNCHECKED)
RhelDbgPrint(TRACE_LEVEL_VERBOSE, " max_segs_candidate[0] : init - the QEMU/KVM hint method (scsi_config.max_sectors) was used to select the candidate : %lu \n", max_segs_candidate[0]);
#endif
} else {
/* Take the smallest candidate */
max_segs_candidate[0] = min((max_segs_candidate[1]), (max_segs_candidate[2]));
#if !defined(RUN_UNCHECKED)
RhelDbgPrint(TRACE_LEVEL_VERBOSE, " max_segs_candidate[0] : init - the smallest candidate was selected : %lu \n", max_segs_candidate[0]);
#endif
}
/* Check the value is within SG list bounds */
max_segs_candidate[0] = min(max(SCSI_MINIMUM_PHYSICAL_BREAKS, max_segs_candidate[0]), (VIRTIO_MAX_SG - 1));
#if !defined(RUN_UNCHECKED)
RhelDbgPrint(TRACE_LEVEL_VERBOSE, " max_segs_candidate[0] : within SG list bounds : %lu\n", max_segs_candidate[0]);
#endif
/* Check the value is within physical bounds */
max_segs_candidate[0] = min(max(SCSI_MINIMUM_PHYSICAL_BREAKS, max_segs_candidate[0]), PHYS_SEGMENTS_LIMIT);
#if !defined(RUN_UNCHECKED)
RhelDbgPrint(TRACE_LEVEL_VERBOSE, " max_segs_candidate[0] : within physical bounds : %lu\n", max_segs_candidate[0]);
#endif
/* Update max_segments for all cases */
adaptExt->max_segments = max_segs_candidate[0];
max_segments_b4_alignment = adaptExt->max_segments;
/* Factor 8 Remapping - may increase performance in certain settings
*
* Allow user to specify factor8_remap via reg key
* [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\vioscsi\Parameters\Device]
* "PerformFactor8Remap"={any dword value here - the value is ignored}
*
* NOTE: This is generally only required when the NumberOfPhysicalBreaks (NOPB) value
* reported to StorPort is incorrect (off-by-one) and the number of segments is
* not a product of 8 (where max_segments modulo 8 would not be equal to zero).
*/
if (VioScsiReadRegistryParameter(DeviceExtension, REGISTRY_FACTOR8_REMAP, FIELD_OFFSET(ADAPTER_EXTENSION, factor8_remap))) {
#if !defined(RUN_UNCHECKED)
RhelDbgPrint(TRACE_LEVEL_VERBOSE, " REGISTRY_FACTOR8_REMAP was FOUND in the registry. We will remap the max_segments value. \n");
#endif
if (max_segments_b4_alignment > 8) {
adaptExt->max_segments = ((max_segments_b4_alignment / 8) * 8);
} else {
// We should never hit this because adaptExt->max_segments >= SCSI_MINIMUM_PHYSICAL_BREAKS (16)
adaptExt->max_segments = 8;
}
#if !defined(RUN_UNCHECKED)
if (max_segments_b4_alignment != (adaptExt->max_segments)) {
RhelDbgPrint(TRACE_LEVEL_INFORMATION, " The max_segments value was remapped using a factor of 8... \n");
RhelDbgPrint(TRACE_LEVEL_VERBOSE,
" Factor 8 Remapping : max_segments : original = %lu, aligned = %lu \n", max_segments_b4_alignment, adaptExt->max_segments);
} else {
RhelDbgPrint(TRACE_LEVEL_INFORMATION, " The max_segments value did not require factor 8 remapping. \n");
}
#endif
} else {
#if !defined(RUN_UNCHECKED)
RhelDbgPrint(TRACE_LEVEL_VERBOSE, " REGISTRY_FACTOR8_REMAP was NOT FOUND in the registry. We will not remap the max_segments value. \n");
#endif
}
}
/* Here we enforce legacy off-by-one NumberOfPhysicalBreaks (NOPB) behaviour for StorPort.
* This behaviour was retained in StorPort to maintain backwards compatibility.
* This is analogous to the legacy MaximumSGList parameter in the SCSI Port driver.
* Where:
* MaximumSGList = ((MAX_BLOCK_SIZE)/PAGE_SIZE) + 1
* The default x86/x64 values being:
* MaximumSGList = (64KiB/4KiB) + 1 = 16 + 1 = 17 (0x11)
* The MAX_BLOCK_SIZE limit is no longer 64KiB, but 2048KiB (2MiB):
* NOPB or MaximumSGList = (2048KiB/4KiB) + 1 = 512 + 1 = 513 (0x201)
*
* ATTENTION: The MS NOPB documentation for both the SCSI Port and StorPort drivers is incorrect.
*
* As max_segments = MAX_BLOCK_SIZE/PAGE_SIZE we use:
*/
ConfigInfo->NumberOfPhysicalBreaks = adaptExt->max_segments + 1;
/* Here we use the efficient single step calculation for MaximumTransferLength
*
* The alternative would be:
* ConfigInfo->MaximumTransferLength = adaptExt->max_segments;
* ConfigInfo->MaximumTransferLength <<= PAGE_SHIFT;
* ...where #define PAGE_SHIFT 12L
*
*/
ConfigInfo->MaximumTransferLength = adaptExt->max_segments * PAGE_SIZE; Now returning to working on Parameter\Device(d) registry reads... |
@benyamin-codez ConfigInfo->FeatureSupport |= STOR_ADAPTER_DMA_V3_PREFERRED; and check if it makes any difference? Thanks, |
Sounds good.
|
Thanks for getting back to me regarding a place in the tree for scripts to live. I see both @kostyanf14 and @YanVugenfirer are CODEOWNERS for the |
Thanks for having a look Vadim. Can you please confirm what it is that you want me to check...? I can say that setting
Did you want me to check whether the performance impact of setting Also thought it worthwhile to share what I now have for this part. Given the default is to use /* NOTE: When unset we get -5k IOPS +30us latency (best case)...! */
ConfigInfo->Master = TRUE; // +7k IOPS -50us latency
ConfigInfo->ScatterGather = TRUE; // +12k IOPS -75us latency
/* Allow user to restrict driver to Dma32BitAddresses ONLY via reg key
* [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\vioscsi\Parameters\Device]
* "OnlyDma32BitAddresses"={any dword value here - the value is ignored}
*
* ATTENTION: StorPort normally sets ConfigInfo->Dma32BitAddresses to TRUE.
* Recommended to let StorPort manage this setting for maximum performance.
*
* WARNING: Manually setting ConfigInfo->Dma32BitAddresses has a negative impact on
* performance, increasing latency and reduces IOPS (up to -15k IOPS +100us latency)
* Setting this reg key disables Dma64BitAddresses.
*/
if (VioScsiReadRegistryParameter(DeviceExtension, REGISTRY_ONLY_DMA32BITADDRESSES, FIELD_OFFSET(ADAPTER_EXTENSION, OnlyDma32BitAddresses))) {
#if !defined(RUN_UNCHECKED)
RhelDbgPrint(TRACE_LEVEL_VERBOSE, " REGISTRY_ONLY_DMA32BITADDRESSES was FOUND in the registry. We will set ConfigInfo->Dma32BitAddresses to TRUE. \n");
RhelDbgPrint(TRACE_LEVEL_VERBOSE, " REGISTRY_ONLY_DMA32BITADDRESSES was FOUND in the registry. We will set ConfigInfo->Dma64BitAddresses to 0. \n");
#endif
ConfigInfo->Dma32BitAddresses = TRUE;
ConfigInfo->Dma64BitAddresses = 0;
} else {
#if !defined(RUN_UNCHECKED)
RhelDbgPrint(TRACE_LEVEL_VERBOSE, " REGISTRY_ONLY_DMA32BITADDRESSES was NOT FOUND in the registry. We will let StorPort manage the Dma32BitAddresses setting. \n");
#endif
}
/* WARNING: Setting ConfigInfo->DmaWidth increases latency and reduces IOPS.
* StorPort sets this to Width8Bits (0x0) but ignores it when ConfigInfo->Master = TRUE
*
* ConfigInfo->DmaWidth = Width32Bits; //<-- This should be zero (Width8Bits) as initialised by StorPort
*/
#if !defined(RUN_UNCHECKED)
RhelDbgPrint(TRACE_LEVEL_VERBOSE, " StorPort : ConfigInfo->Master : %s \n", (ConfigInfo->Master) ? "ON" : "OFF");
RhelDbgPrint(TRACE_LEVEL_VERBOSE, " StorPort : ConfigInfo->ScatterGather : %s \n", (ConfigInfo->ScatterGather) ? "ON" : "OFF");
RhelDbgPrint(TRACE_LEVEL_VERBOSE, " StorPort : ConfigInfo->Dma32BitAddresses : %s \n", (ConfigInfo->Dma32BitAddresses) ? "ON" : "OFF");
RhelDbgPrint(TRACE_LEVEL_VERBOSE, " StorPort : ConfigInfo->DmaWidth : %lu \n", ConfigInfo->DmaWidth);
#endif
#if defined(NTDDI_WIN10_VB) && (NTDDI_VERSION >= NTDDI_WIN10_VB)
ConfigInfo->DmaAddressWidth = 64; // <-- Still yet to test on Win11 target...
#if !defined(RUN_UNCHECKED)
RhelDbgPrint(TRACE_LEVEL_VERBOSE, " ConfigInfo->DmaAddressWidth : %s \n", ConfigInfo->DmaAddressWidth);
#endif
#else
#if !defined(RUN_UNCHECKED)
RhelDbgPrint(TRACE_LEVEL_VERBOSE, " ConfigInfo->DmaAddressWidth is NOT supported in this version of the driver. \n");
#endif
#endif
if (ConfigInfo->Dma64BitAddresses == SCSI_DMA64_SYSTEM_SUPPORTED) {
ConfigInfo->Dma64BitAddresses = SCSI_DMA64_MINIPORT_FULL64BIT_SUPPORTED;
} |
Yes. Ideally we need to check if toggling Dma32BitAddresses has effect on both x86 and x64 platforms (and even probably on arm64) with and without STOR_ADAPTER_DMA_V3_PREFERRED enabled. But it is going to be a quite massive test case, so I will try to run it by myself. All the best, |
....
I think similar to https://github.com/virtio-win/kvm-guest-drivers-windows/tree/master/NetKVM/DebugTools - put the scripts under DebugTools in /vioscsi/ |
Thanks for the clarification, Vadim. That will indeed be quite the undertaking... As for x64, I ran a few more tests on my Win10 target but with a couple more variables as follows: Test was 4KiB random, 32 queues, 16 threads on 4KiB segments with 2GB test file (random fill). PERF = STOR_PERF_DPC_REDIRECTION_CURRENT_CPU + STOR_PERF_OPTIMIZE_FOR_COMPLETION_DURING_STARTIO
Despite the effect of STOR_ADAPTER_DMA_V3_PREFERRED not being wholly consistent (and the differences are somewhat interesting), it would appear that letting StorPort manage the Dma32Addresses setting (which it sets to TRUE anyway) is still the best catch-all option. I note the latency values are higher than expected due to additional tracing. Best regards, |
@benyamin-codez I hope you're doing well! I wanted to ask if this PR can theoretically fix issue 907. Additionally, is there a plan for when this PR will be merged into master? Thank you for your help! Best regards, |
@lixianming19951001 |
A PR to reduce spinlock management complexity has been raised: #1175 |
A PR to improve tracing capability has been raised: #1176 |
@benyamin-codez Do you want to close this PR or you have some more changes? |
I have a few more PRs to raise. I'll try to drop a list here in the next few hours to seek input as to priorities. I can then update that list with PR numbers as I raise them. I'm happy to close this PR if those comments can still be made, which I believe is the case, yes...? |
@benyamin-codez sure, thanks! |
Fixes regression in SendSRB of [vioscsi]
Related issues:
#756
#623
#907
...
[likely others]
Regression was introduced in #684 between 7dc052d and fdf56dd.
Specifically due to commits f1338bb and fdf56dd (considered together).
PR includes:
The WPP improvements include:
Look out for 1x TODO + 2x FIXME... RFC please.
Tested on:
QEMU emulator version 9.0.2 (pve-qemu-kvm_9.0.2-2) ONLY
Linux 6.8.12-1-pve #1 SMP PREEMPT_DYNAMIC PMX 6.8.12-1 (2024-08-05T16:17Z) x86_64 GNU/Linux
iothread=0
)threads
ONLY - UNTESTED onio_uring
ornative
AIO (suspect this will fix issues here)raw
andqcow
) ONLY, untested on networked backingsYMMV with other platforms and AIO implementations (but I suspect they will be fine).
Performance is significantly higher, even when compared to version 100.85.104.20800 in virtio-win package 0.1.208.
I suspect this is mainly because this fix unlocks other improvements made when the regression was introduced.
Occasional throughput of 12+GB/s were achieved in guest (64KiB blocks, 32 outstanding queues and 16 threads).
Sequential reads performed at just below the backing storage max throughput at 7GB/s (1MiB blocks, 8 outstanding queues and 1 thread). Random reads came in at 1.1GB/s, 270K IOPS at 1850 µs latency for (4KiB blocks, 32 outstanding queues and 16 threads). Write performance was approximately 93% of read performance.
I tested on a variety of local storage mediums but the above numbers are for a 7.3GB/s Read / 6GB/s Write 1M IOPS R/W NVMe SSD. So throughput was as expected but IOPS were only ~27%.... It will be interesting to see what numbers come up on a decently sized NVMe RAID0/10 JBOD...! Certainly beware of running on old crusty spindles..! 8^d
It is also worth mentioning that the ETW overhead when running an active trace with all flags set is about 8%.
Freedom for Windows guests once held captive...! 8^D
cc: @vrozenfe @JonKohler @sb-ntnx @frwbr @foxmox
Related external issues (at least confounded by this regression):
https://bugzilla.proxmox.com/show_bug.cgi?id=4295
https://bugzilla.proxmox.com/show_bug.cgi?id=4295
https://bugzilla.kernel.org/show_bug.cgi?id=199727
...
[likely others]