-
Notifications
You must be signed in to change notification settings - Fork 723
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test_openjdk8_j9_sanity.system_x86-64_linux_xl LambdaLoadTest_ConcurrentScavenge_0 OOM #6475
Comments
@amicic can somebody take a look at this please. |
I was eventually able to reproduce in local environment. It's due to excessive GC. Will investigate further. |
Reproducibility is about 1/20. It does not seem to be Concurrent Scavenger related. I could reproduce this even with STW Scavenger and in fact with Optthruput GC policy. It will probably occur only in Java 8 and only with XL JVM, because heap is set to 512M and there is more stress with XL JVM. This is not even a recent regression. It would occur in Release 0.11, too. I suspect that the test itself is broken. Normally, it starts 6 concurrent test treads and they finish within 1-2 seconds, with very low memory (heap) requirements. However, occasionally one of the threads would not terminate, but keep allocating, and more importantly retaining relatively large amount objects (200-300MB). If only one such threads 'misbehaves' test would timeout, but still PASS! However, if 2 such threads misbehave, live set would approximately double, the heap (512M) would fill up, there would be constant GCing and OOM would be thrown due to excessive GC. Here is the output when 2 of 6 threads would not complete as normal: LT 08:41:45.813 - Starting thread. Suite=0 thread=0 I tried running with Hotspot, but for some reason I was not able to. |
@Mesbah-Alam can you please help Aleks run the test with Hotspot. |
@amicic , What GC policy should be used while running the test on HotSpot? Or do you predict that the same issue will be reproduced on HotSpot irrespective of any GC policy used? This test was tagged to run on OpenJ9 only https://github.com/AdoptOpenJDK/openjdk-tests/blob/865ba275482fe6076a51730ad1451b16ea5f3314/systemtest/lambdaLoadTest/playlist.xml#L116 It uses the following GC options : |
GC policy is irrelevant for either OpenJ9 or Hotspot, Defaults are OK. If we see messages like (which occur approximately 1 in 4 runs for OpenJ9) LT 08:42:06.503 - Completed 3.0%. Number of tests started=6 with HotSpot, that would indicate that test is somewhat broken/inconstant. Once it occurs, and lasts for some time (10s of seconds), in OpenJ9 it may lead to OOM (due to excessive GC). Not necessarily for HotSpot, since they may have different criteria for 'excessive GC'. Not always even would OpenJ9 lead to OOM if we see these messages. So, looking into just PASS/FAIL of the test, we cannot determine if the test has a problem with HotSpot, too. We have to look into occurrence of the above log message. |
@amicic - thanks for the clarification.
There is no HotSpot Large Heap Linux x64 JDK8 at Adopt : https://adoptopenjdk.net/nightly.html?variant=openjdk8&jvmVariant=hotspot Would testing with regular HotSpot Linux x64 JDK8 suffice? |
Does not have to be Large Heap variant (which I believe could be enabled by an option, from the same VM). although Large Heap variant is more likely to cause OOM (for OpenJ9, anyway). But again, since we do not look for OOM, but for the log messages, XL or not is irrelevant. |
Hi @amicic, 20x Grinder on Hotspot does not reproduce the issue: https://ci.adoptopenjdk.net/view/Test_grinder/job/Grinder-systemtest/179/tapResults/ 40x Grinder on HotSpot also does not reproduce the issue: https://ci.adoptopenjdk.net/view/Test_grinder/job/Grinder-systemtest/180/tapResults/ Ran with JDK 8, Linux x64 (regular heap size) with "NoOptions" mode. Is thhere any other option we could try to reproduce the issue? |
Indeed, none of these 60 runs had that intermediate 'Completed...' message. One difference was that OpenJ9 runs were with 6 threads and these were with only 2. Could you repeat it with 6 threads? |
For clarification, I'm again showing the problematic output, but with test actually passed (occurs 1 in 4 runs with OpenJ9):
This is normal output:
Besides the extra message, one can see that it took extra 25 sec for 'bad' test (for one of six threads) to complete, which normally takes only about 1 sec. |
Hi @amicic, Here's how the test sets the number of threads it will use:
Where, cpuCount = Runtime.getRuntime().availableProcessors() Could you please share the name of the machine on which you were able to produce the error? May be I should use the same machine for HotSpot runs too. |
I was using my personal cloud (Fyre) instance, that had 8 threads. |
Running the test in an internal machine with 8 cpu's produce results where we do see similar unusual behaviour where 1 or 2 threads take extra 20sec:
|
Here's the main class that drives the "lambda" load: https://github.com/AdoptOpenJDK/openjdk-systemtest/blob/master/openjdk.test.load/src/test.load/net/adoptopenjdk/stf/LambdaLoadTest.java Lists of tests that are run in the load are here: https://github.com/AdoptOpenJDK/openjdk-systemtest/blob/master/openjdk.test.load/config/inventories/lambdasAndStreams/lambda.xml |
Copying @amicic's comments from Slack:
|
Hi @pshipton, could you please share your comments as to whether or not it is worthwhile pursuing this investigation further? (or should we end this issue by setting -Xmx1G to avoid OOMs?) |
@Mesbah-Alam it seems worth more investigation in order to have a robust test. Were you able to reproduce the problem with Hotspot using 8 cpus? Your comment doesn't say. Hard coding a bigger heap size doesn't seem like a fix, but a hack which may not always work. That said, it does make sense that more threads can run more concurrent testing, which will have a bigger memory requirement. Perhaps the test should limit the number of threads created. Can you get a core file for the OOM failure, and look at it with Eclipse MAT to see what is consuming all the heap space? |
and btw, when testing with Hotspot to see if the issue can be reproduced, -Xmx512m should be used. |
Another |
A 5x grinder on ppc64le_linux_xl failed 3/5 |
|
https://ci.eclipse.org/openj9/job/Test_openjdk8_j9_sanity.system_ppc64le_linux_xl_Nightly/33
|
https://ci.eclipse.org/openj9/job/Test_openjdk8_j9_sanity.system_ppc64le_linux_xl_Nightly/57 @amicic @Mesbah-Alam What is the plan to fix this test? Should we exclude it until it can run properly? |
https://ci.eclipse.org/openj9/job/Test_openjdk8_j9_sanity.system_ppc64le_linux_xl_OMR/9 @Mesbah-Alam can you please exclude this test, only for large heap builds if possible. |
Looks like this this issue is not GC but related. Changing label to |
These OOMs should no longer occur as a result the changes made for adoptium/aqa-systemtest#379 |
The test no longer seems to be excluded, the exclude removed by adoptium/aqa-tests#2229 , and I see LambdaLoadTest_CS_5m running in the builds. |
https://ci.eclipse.org/openj9/job/Test_openjdk8_j9_sanity.system_x86-64_linux_xl_Nightly/98
The text was updated successfully, but these errors were encountered: