-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HPO 659: Mix read/write with md5 checking halts all I/O on all metalLB IP's and does not recover unless noobaa pod restart #6934
Comments
I need to amend the discovery of this defect. It has nothing to do with CNSA mmshutdown on a scale core pod. The trigger for this is to have s3 service with MD5 enabled and cosbench workload with mixed read and write while doing hashCheck=true.
I will see that the cosbench workload fails. When I fails, a manual s3 cp times out, while s3 ls still works. |
Can we make this a sev 1 defect? I think this is a critical problem for us. Could this be a noobaa db deadlock. I don't know the commands to check for noobaa db deadlock |
@MonicaLemay |
Curl works when the condition exists. The IP's are reachable.
Also s3 ls will work and will list the buckets. The s3 cp times out so no PUT works. |
Few observations:
I feel this stale file handle error is expected, but maybe i'm missing something?
|
@romayalon Thanks Romy for the comment. I'm not as worried about Stale File handle and Cosbench slow down. What concerns me the most is after I hit this condition, I kill cosbench workload. After that I cannot do s3 cp . I can wait 12 hours and still cannot do an s3 cp file. I have to restart the noobaa pods. After I restart noobaa pods, I can do s3 cp |
@MonicaLemay So can you please check again the cp command connection details (specifically the endpoint address) and check if the requests were received in the endpoint pods? A Higher debug level is needed here probably. |
Did some follow-up investigation asked for by Romy. In the process, I have refined the trigger.
3. Cosbench w/ hashcheck=true AND files 1 - 2 Gig will causes s3 cp file to timeout. This is the error that this defect needs to persue. The best xml file to reproduce this quickly is:
One test was done with log level nsfs. Noobaa diagnose and must-gather are colllected and uploaded to our box note.
The hang occurred around
Wed Apr 13 15:36:10 EDT 2022
NOW LOG LEVEL ALL
Now try to s3 cp a file
In all cases oc logs does not show any errors. |
@MonicaLemay This is a great elimination! I probably missed it mentioned above somewhere - how much time, or which actions, it takes to recover from this situation? Few suggestions:
|
-When the condtion occurs s3 mb and rb works. Only PUT and GET fail to the 2 IP's that were doing cosbench simultaneous read/write with hashCheck=true.
|
I can provide IP addresses for storage cluster, openshift cluster and appllication nodes if you noobaa team would like to debug on my cluster. |
@MonicaLemay Thanks!
|
One more comment - I cannot recreate this with warp. I just did a wrap run and did not see the problem. Warp and cosbench have different ways of doing tests. I think it is an important data point but not enought to yet blame cosbench as being defective Please let me know if you need to meet with me. I can clarify all of your questions very quickly. |
Hi @MonicaLemay This is what I found regarding good.xml -
And the same for rand_read.xml I see a SlowDown error-
|
@MonicaLemay Do we use any custom env overrides for
There is a log print of the semaphore and buffers pool initial values here so we can check the actual value used. @romayalon I see we left the upload_object flow with a non-accurate update to the semaphore because we don't have full visibility to the incoming http request buffers being allocated by the underlying nodejs http library. So in our code we surround the upload with just a single 8MB acquired from the semaphore in the assumption that this represents a rough estimate of the memory we use during the flow, but we know that nodejs streams can take more memory while flowing. I would try to create a stress test case that we can run locally and see if the same effect can be observed. Of course the root cause might be elsewhere still. Here is where we acquire the semaphore on upload - noobaa-core/src/sdk/namespace_fs.js Lines 687 to 692 in f50464d
We might also want to look for the timeout warning prints that should give us the stack trace of the caller more accurately (we added those when we were debugging read flow semaphore cases) - noobaa-core/src/util/semaphore.js Lines 84 to 88 in f50464d
|
@guymguym Sure I'll try to create a stress test.. |
@romayalon If I understand this hashCheck=true mode correctly - this has no effect on the endpoint behavior, it only changes how the cosbench workers behaves, right? It might have a side effect on scheduling due to the workers having more cpu work between requests. But it is an interesting effect to analyze. |
@romayalon There are 2. One is from the run 967 which had NSFS debug level The xml that I have posted in an above thread where I give info about run 967 and 969 is the shortest and best one. I'll copy from above and put it here so that you don't have to fish for it.
@guymguym In my testing it is customer level testing. I am not allowed to modify anything other than if it is provided by our command line interface. In other words, I cannot modify memory pool buffers by editing deployments/pods. I have to use the default CPU and Memory . I hope I answered all of your questions. |
Got you @MonicaLemay. Just to clarify I didn't mean to use that as a valid workaround or anything funky like that, only as a debugging technique as we are still chasing the root cause of this issue. But I understand the concern with making these out-of-band changes to a long running system and whether undoing it will actually reset the configuration back correctly. I'll try to come up with more ideas on how to narrow down the problem from your side without making configuration changes, while we try to reproduce it on dev environment. Thanks! |
@MonicaLemay couple more questions:
|
@guymguym
And after a while read timeouts started happening:
And for s5002b2/s5002o3_rGB:
And for s5002b5/s5002o9_rGB: (This time the error code is slowDown and not Einval as we had in the previous bug)
|
@romayalon The first few are GET errors - perhaps I missed it and |
OK so I see in cosbench code that hashCheck means:
So the error you found - This seems weird indeed - I would try to look at this file from the FS in hex dump ( |
I wonder why this "no checksum embedded" log was written with "[NoneStorage]" and not "[S3Storage]" like the rest... |
@guymguym yes, so according to the xml and the logs:
@MonicaLemay do you see the template Guy mentioned [NoneStorage] - most of the log messages are written with this [NoneStorage] in the log. |
When files are size 3 - 4 Gig, No files get completly written:
The cosbench log:
|
When files are size 100 - 200 MB the condition does not occur. This defect does not happen. |
@MonicaLemay Are you running any special branch/fork of cosbench? or a release? or latest commit from its master branch? |
@MonicaLemay @romayalon Hmmmmmmmmm this RandomInputStream code where you see the ArrayIndexOutOfBoundsException error is definitely looking to me like the root cause of this entire thing. It mixes up 32bit signed integers with 64bit in calculations, which is very well known to be prone to bugs around the 2 GB offset (which is the maximum positive 32bit signed value). I'll try to find the exact problem but I sense that both cases are caused by this code. |
Ok I think I found it, and opened an issue and a PR to fix it on cosbench. |
In order to follow up on the hex from the FS, I had to do another run. I had deleted the data from previous runs to do the requested MB and > 3GB tests. New run:
From teh cosbench log
From the FS
I'm not sure what to look at with the xxd. I am looking at file s5003o2_rGB in bucket s5003b2. The last few lines from xxd is this:
|
@MonicaLemay Thanks! So what we can see in the hex dump is that it ends with a
This means that in this case, the write itself actually finished fine, which makes sense because the size is 2,000,000,000 bytes which is 140 MB less than 2 GB so it shouldn't have caused int overflow. However when I look at the Reader validateChecksum() code here - I get a strong feeling that I see another bug since the code ignores the bytes in buf2. |
@guymguym Awesome! So you think that these failures cause retries and that's what creates more stress? |
@guymguym @romayalon |
Hey so I think it's still hard for me to determine.
What I wanted to suggest was to try and build cosbench with the fix in intel-cloud/cosbench#426 to see if this eliminates any of those cases. The initial approach we discussed here was to add custom instrumentation to the noobaa endpoints code to be able to pin point bottlenecks that could cause those timeouts. Both directions would make positive progress - WDYT? any preference? |
@guymguym I looked at the instructions to build cosbench from the PR https://github.com/intel-cloud/cosbench/blob/master/BUILD.md Unfortunately, our team may be deciding to not invest in cosbench as a system test tool and we will not be building it. Should this defect proceed w/out the cosbench fix or should we now close this defect if you think everything we are seeing is a cosbench problem?? |
@MonicaLemay Thanks. I do believe that the fact that you had to restart the endpoints is still an issue here. The fact that we observed this no-service issue only with (potentially buggy) cosbench with hashCheck=true, means there is a pathological case in the noobaa code. Perhaps we still have an issue with promises and the semaphore as we had in #6848. @nimrod-becker I would rename this issue title to represent this is about cosbench +hashCheck = manual restart, and I would say to keep it open because we don't know why we had to restart the endpoints. @romayalon I would look for the semaphore messages (like this semaphore warning stuck and the buffer pool warning stuck) in the endpoints logs and see if there's anything reported. |
Hi @MonicaLemay , Are you blocked on your story for this fix ? Or can you still work upon on your story with md5 check off. |
@romayalon I am able to make progress on my user story. I can use warp. I reported this as a blocker b/c this defect is possibly a blocker to the MVP release I think that part of the message got lost in the hand-off of status. If you do not agree that this is a blocker to MVP please let me know. I made that assessment b/c of the above statement:
Please let me know what you think. |
@MonicaLemay I didn't say it's a blocker / not a blocker for the MVP. |
@romayalon Thanks for the clarification. I have 2 more stories to test that do not involve hashCheck=true. I can continue. |
@romayalon The test results for removing hashCheck=true from the read stage does NOT cause the outage (can't cp file). Here is the xml file.
|
@romayalon are you planning to have this in the 4.9.6 code ? |
@rkomandu first I'll build these changes on 4.9 branch so we can see that it solves Monica's issue and then we can discuss if it'll get into 4.9.6. |
@rkomandu @MonicaLemay |
@MonicaLemay an updated image named noobaa/noobaa-core:moki-20220510 is now available. |
@MonicaLemay The most updated image named noobaa/noobaa-core:moki-20220511 is now available. |
I tested the patch today and it works. I no longer see the issue. Thank you. |
@MonicaLemay is there something else we should do here or we can close? |
As far as I am concerned it can be closed. I don't know the Noobaa team's procedures. Some teams leave defects open until a fix or patch is in an official build. If this team does not do this, then it can be closed. |
according to the last comments, will close. |
Environment info
[root@c83f1-app1 ~]# noobaa status
INFO[0000] CLI version: 5.9.2
INFO[0000] noobaa-image: noobaa/noobaa-core:nsfs_backport_5.9-20220331
INFO[0000] operator-image: quay.io/rhceph-dev/odf4-mcg-rhel8-operator@sha256:01a31a47a43f01c333981056526317dfec70d1072dbd335c8386e0b3f63ef052
INFO[0000] noobaa-db-image: quay.io/rhceph-dev/rhel8-postgresql-12@sha256:98990a28bec6aa05b70411ea5bd9c332939aea02d9d61eedf7422a32cfa0be54
[root@c83f1-app1 ~]# oc get csv
NAME DISPLAY VERSION REPLACES PHASE
mcg-operator.v4.9.5 NooBaa Operator 4.9.5 mcg-operator.v4.9.4 Succeeded
ocs-operator.v4.9.5 OpenShift Container Storage 4.9.5 ocs-operator.v4.9.4 Succeeded
odf-operator.v4.9.5 OpenShift Data Foundation 4.9.5 odf-operator.v4.9.4 Succeeded
Actual behavior
This is not the same as issue 6930.
In this issue that I am opening, it is true that the node remained in the Ready state so I don't expect any IP failover. This defect is not about metallb IP's not failing over. In this defect, I/O was running to metallb IP 172.20.100.31 which is for node master1. On node master0 , in the CNSA scale core pod, (namespace ibm-spectrum-scale), mmshutdown was issued for just that node. The other nodes remained active and with the filesystem mounted. Master0 has metallb IP 172.20.100.30. There was no I/O going to that IP.
What was observed after mmshutdown on master0 was that all I/O going to 172.20.100.31 stopped. Because of issue 6930, there was no failover. That is fine and expected. But what is not expected is for all I/O to stop.
When mmshutdown was issued, the noobaa endpoint pods only error was Stale file handle
This error is a bit odd because it is on the endpoint pod that was for master0. Master0's metallb IP was 172.20.100.30. Cosbench workload was only set up for 172.20.100.31.
An additional observation is that s3 command for list will work but not for write.
All future PUT to 172.20.100.31 and 172.20.100.32, get a timeout (if I don't CTL-C) and the endpoint pods record a "Error: Semaphore Timeout"
From 31 and 32, we can do GET and we can read from the Noobaa database. If we rsh into the endpoint pods for the IP's of 172.20100.31 and 172.20.100.32, we see that Spectrum Scale is still mounted in the correct place and we can write to it manually with touch file. So, this tells us that the IP's 31 and 32 are still alive and that the noobaa db is still online. It also tells us that the Spectrum Scale filesystem is still mounted and writable. The timeout on the subsequent PUT tell us that it makes a connection request but never gets a response.
The endpoint pods never restarted and they sill have their labels.
Also, In the scale core pod we run mmhealth node show -N all and we see that everything is HEALTHY, except of course on the one node that we did mmshutdown.
Something is obviously hung in the PUT connection but logs and noobaa health don't point to anything.
When we issue mmstartup the PUT's still fail. The only way to recover is to delete the Noobaa endpoint pods and have new ones generated again.
I have been able to recreate this very easily so if it is required I can set this up on my test stand
Expected behavior
1.When doing mmshutdown on one node, it should not impact cluster wide I/O capability. It should not be an outage. If indeed an outage is expected, then mmstartup should recover I/O capability.
Steps to reproduce
More information - Screenshots / Logs / Other output
Must gather and noobaa diagnose in https://ibm.ent.box.com/folder/145794528783?s=uueh7fp424vxs2bt4ndrnvh7uusgu6tocd
This issue started as HPO https://github.ibm.com/IBMSpectrumScale/hpo-core/issues/659 Screeners determined that it was with Nooobaa. I have also slacked the CNSA team for input but have not heard back.
The text was updated successfully, but these errors were encountered: