Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test_openjdk8_j9_sanity.system_x86-64_linux_xl LambdaLoadTest_ConcurrentScavenge_0 OOM #6475

Closed
pshipton opened this issue Jul 16, 2019 · 32 comments

Comments

@pshipton
Copy link
Member

https://ci.eclipse.org/openj9/job/Test_openjdk8_j9_sanity.system_x86-64_linux_xl_Nightly/98

LT  06:46:01.051 - Completed 3.0%. Number of tests started=6 (+0)
LT  stderr JVMDUMP039I Processing dump event "systhrow", detail "java/lang/OutOfMemoryError" at 2019/07/16 06:46:13 - please wait.
LT  stderr JVMDUMP039I Processing dump event "systhrow", detail "java/lang/OutOfMemoryError" at 2019/07/16 06:46:13 - please wait.
@pshipton
Copy link
Member Author

1TISIGINFO Dump Event "systhrow" (00040000) Detail "java/lang/OutOfMemoryError" "Java heap space" received

@amicic can somebody take a look at this please.

@amicic
Copy link
Contributor

amicic commented Jul 16, 2019

I was eventually able to reproduce in local environment. It's due to excessive GC. Will investigate further.

@amicic
Copy link
Contributor

amicic commented Jul 22, 2019

Reproducibility is about 1/20.

It does not seem to be Concurrent Scavenger related. I could reproduce this even with STW Scavenger and in fact with Optthruput GC policy.

It will probably occur only in Java 8 and only with XL JVM, because heap is set to 512M and there is more stress with XL JVM.

This is not even a recent regression. It would occur in Release 0.11, too.

I suspect that the test itself is broken. Normally, it starts 6 concurrent test treads and they finish within 1-2 seconds, with very low memory (heap) requirements. However, occasionally one of the threads would not terminate, but keep allocating, and more importantly retaining relatively large amount objects (200-300MB). If only one such threads 'misbehaves' test would timeout, but still PASS! However, if 2 such threads misbehave, live set would approximately double, the heap (512M) would fill up, there would be constant GCing and OOM would be thrown due to excessive GC.

Here is the output when 2 of 6 threads would not complete as normal:

LT 08:41:45.813 - Starting thread. Suite=0 thread=0
LT 08:41:45.814 - Starting thread. Suite=0 thread=1
LT 08:41:45.821 - Starting thread. Suite=0 thread=2
LT 08:41:45.822 - Starting thread. Suite=0 thread=3
LT 08:41:45.822 - Starting thread. Suite=0 thread=4
LT 08:41:45.823 - Starting thread. Suite=0 thread=5
LT 08:41:46.075 - Thread completed. Suite=0 thread=0
LT 08:41:46.077 - Thread completed. Suite=0 thread=4
LT 08:41:46.077 - Thread completed. Suite=0 thread=5
LT 08:41:46.133 - Thread completed. Suite=0 thread=3
LT 08:42:06.503 - Completed 3.0%. Number of tests started=6
LT stderr JVMDUMP039I Processing dump event "systhrow", detail "java/lang/OutOfMemoryError" at 2019/07/22 08:42:26 - please wait.

I tried running with Hotspot, but for some reason I was not able to.
I think this should be looked by test team next and try to reproduce with Hotspot. If this thread misbehaviour occurs there too, the test is broken, otherwise VM team needs to investigate further.

@pshipton
Copy link
Member Author

@Mesbah-Alam can you please help Aleks run the test with Hotspot.
@llxia @smlambert fyi

@Mesbah-Alam
Copy link
Contributor

Mesbah-Alam commented Jul 24, 2019

@amicic , What GC policy should be used while running the test on HotSpot? Or do you predict that the same issue will be reproduced on HotSpot irrespective of any GC policy used?

This test was tagged to run on OpenJ9 only https://github.com/AdoptOpenJDK/openjdk-tests/blob/865ba275482fe6076a51730ad1451b16ea5f3314/systemtest/lambdaLoadTest/playlist.xml#L116

It uses the following GC options : -Xgcpolicy:gencon -Xgc:concurrentScavenge -- which I believe will not work on HotSpot.

@amicic
Copy link
Contributor

amicic commented Jul 24, 2019

GC policy is irrelevant for either OpenJ9 or Hotspot, Defaults are OK.

If we see messages like (which occur approximately 1 in 4 runs for OpenJ9)

LT 08:42:06.503 - Completed 3.0%. Number of tests started=6

with HotSpot, that would indicate that test is somewhat broken/inconstant.

Once it occurs, and lasts for some time (10s of seconds), in OpenJ9 it may lead to OOM (due to excessive GC). Not necessarily for HotSpot, since they may have different criteria for 'excessive GC'. Not always even would OpenJ9 lead to OOM if we see these messages.

So, looking into just PASS/FAIL of the test, we cannot determine if the test has a problem with HotSpot, too. We have to look into occurrence of the above log message.

@Mesbah-Alam
Copy link
Contributor

@amicic - thanks for the clarification.

It will probably occur only in Java 8 and only with XL JVM, because heap is set to 512M and there is more stress with XL JVM

There is no HotSpot Large Heap Linux x64 JDK8 at Adopt : https://adoptopenjdk.net/nightly.html?variant=openjdk8&jvmVariant=hotspot

Would testing with regular HotSpot Linux x64 JDK8 suffice?

@amicic
Copy link
Contributor

amicic commented Jul 24, 2019

Does not have to be Large Heap variant (which I believe could be enabled by an option, from the same VM). although Large Heap variant is more likely to cause OOM (for OpenJ9, anyway). But again, since we do not look for OOM, but for the log messages, XL or not is irrelevant.

@Mesbah-Alam
Copy link
Contributor

Mesbah-Alam commented Jul 24, 2019

Hi @amicic,

20x Grinder on Hotspot does not reproduce the issue: https://ci.adoptopenjdk.net/view/Test_grinder/job/Grinder-systemtest/179/tapResults/

40x Grinder on HotSpot also does not reproduce the issue: https://ci.adoptopenjdk.net/view/Test_grinder/job/Grinder-systemtest/180/tapResults/

Ran with JDK 8, Linux x64 (regular heap size) with "NoOptions" mode. Is thhere any other option we could try to reproduce the issue?

@amicic
Copy link
Contributor

amicic commented Jul 25, 2019

Indeed, none of these 60 runs had that intermediate 'Completed...' message.

One difference was that OpenJ9 runs were with 6 threads and these were with only 2.

Could you repeat it with 6 threads?

@amicic
Copy link
Contributor

amicic commented Jul 25, 2019

For clarification, I'm again showing the problematic output, but with test actually passed (occurs 1 in 4 runs with OpenJ9):

15:05:44.899 - Starting thread. Suite=0 thread=0
15:05:44.900 - Starting thread. Suite=0 thread=1
15:05:44.900 - Starting thread. Suite=0 thread=2
15:05:44.900 - Starting thread. Suite=0 thread=3
15:05:44.901 - Starting thread. Suite=0 thread=4
15:05:44.901 - Starting thread. Suite=0 thread=5
15:05:45.055 - Thread completed. Suite=0 thread=5
15:05:45.074 - Thread completed. Suite=0 thread=2
15:05:45.075 - Thread completed. Suite=0 thread=0
15:05:45.117 - Thread completed. Suite=0 thread=4
15:05:45.141 - Thread completed. Suite=0 thread=3
15:06:04.966 - Completed 3.0%. Number of tests started=6
15:06:10.599 - Thread completed. Suite=0 thread=1
15:06:10.626 - Test stopped due to reaching runtime limit
15:06:10.626 - Load test completed
15:06:10.626 -   Ran     : 6
15:06:10.626 -   Passed  : 6
15:06:10.626 -   Failed  : 0
15:06:10.627 -   Result  : PASSED

This is normal output:

09:16:26.857 - Starting thread. Suite=0 thread=0
09:16:26.858 - Starting thread. Suite=0 thread=1
09:16:26.858 - Starting thread. Suite=0 thread=2
09:16:26.858 - Starting thread. Suite=0 thread=3
09:16:26.859 - Starting thread. Suite=0 thread=4
09:16:26.859 - Starting thread. Suite=0 thread=5
09:16:27.114 - Thread completed. Suite=0 thread=2
09:16:27.114 - Thread completed. Suite=0 thread=1
09:16:27.146 - Thread completed. Suite=0 thread=5
09:16:27.260 - Thread completed. Suite=0 thread=3
09:16:27.370 - Thread completed. Suite=0 thread=0
09:16:27.416 - Thread completed. Suite=0 thread=4
09:16:27.460 - Test stopped due to reaching runtime limit
09:16:27.461 - Load test completed
09:16:27.461 -   Ran     : 6
09:16:27.461 -   Passed  : 6
09:16:27.461 -   Failed  : 0
09:16:27.461 -   Result  : PASSED

Besides the extra message, one can see that it took extra 25 sec for 'bad' test (for one of six threads) to complete, which normally takes only about 1 sec.

@Mesbah-Alam
Copy link
Contributor

One difference was that OpenJ9 runs were with 6 threads and these were with only 2.

Hi @amicic,

Here's how the test sets the number of threads it will use:

.setSuiteThreadCount(cpuCount - 2, 2) // Leave 1 cpu for the JIT, 1 cpu for GC and set min 2

Where, cpuCount = Runtime.getRuntime().availableProcessors()

Could you please share the name of the machine on which you were able to produce the error? May be I should use the same machine for HotSpot runs too.

@amicic
Copy link
Contributor

amicic commented Jul 25, 2019

I was using my personal cloud (Fyre) instance, that had 8 threads.

@Mesbah-Alam
Copy link
Contributor

Running the test in an internal machine with 8 cpu's produce results where we do see similar unusual behaviour where 1 or 2 threads take extra 20sec:

14:03:49  LT  11:03:47.136 - Thread completed. Suite=0 thread=0
14:03:49  LT  11:03:47.155 - Thread completed. Suite=0 thread=5
14:03:49  LT  11:03:47.312 - Thread completed. Suite=0 thread=2
14:03:49  LT  11:03:47.452 - Thread completed. Suite=0 thread=4
14:04:14  LT  11:04:07.202 - Completed 3.0%. Number of tests started=6
14:04:18  LT  11:04:17.407 - Thread completed. Suite=0 thread=3
14:04:18  LT  11:04:17.458 - Thread completed. Suite=0 thread=1
14:04:18  LT  11:04:17.496 - Test stopped due to reaching runtime limit
14:04:18  LT  11:04:17.497 - Load test completed

@Mesbah-Alam
Copy link
Contributor

Mesbah-Alam commented Jul 29, 2019

Copying @amicic's comments from Slack:

As for the the original failure, a quick workaround is to provide -Xmx1G to avoid OOMs

Inconsistency could possibly be explained by .setSuiteRandomSelection(); // Randomly pick the next test each time

Perhaps not all tests are ran all the time and some genuinely need more time and have larger live set (more heap required) (edited); although having 15 or so tests in the list and running 200 (as per setSuiteNumTests(200) ), likely all are run all the time.

The fact that we don't see a problem with 2 test threads and we do with 6, is also probably relevant. It somewhat suggests some race condition in the test (which are generally more likely with more threads).

@Mesbah-Alam
Copy link
Contributor

Hi @pshipton, could you please share your comments as to whether or not it is worthwhile pursuing this investigation further? (or should we end this issue by setting -Xmx1G to avoid OOMs?)

@pshipton
Copy link
Member Author

@Mesbah-Alam it seems worth more investigation in order to have a robust test. Were you able to reproduce the problem with Hotspot using 8 cpus? Your comment doesn't say.

Hard coding a bigger heap size doesn't seem like a fix, but a hack which may not always work. That said, it does make sense that more threads can run more concurrent testing, which will have a bigger memory requirement. Perhaps the test should limit the number of threads created.

Can you get a core file for the OOM failure, and look at it with Eclipse MAT to see what is consuming all the heap space?

@pshipton
Copy link
Member Author

and btw, when testing with Hotspot to see if the issue can be reproduced, -Xmx512m should be used.

@JasonFengJ9
Copy link
Member

@pshipton
Copy link
Member Author

pshipton commented Dec 6, 2019

A 5x grinder on ppc64le_linux_xl failed 3/5
https://ci.eclipse.org/openj9/job/Grinder/563/

@pshipton
Copy link
Member Author

@JasonFengJ9
Copy link
Member

https://ci.eclipse.org/openj9/job/Test_openjdk8_j9_sanity.system_ppc64le_linux_xl_Nightly/18/tapResults/

LT  23:36:11.072 - First failure detected by thread: load-5. Not creating dumps as no dump generation is requested for this load test
LT  23:36:09.041 - Uncaught exception: 
LT  java.lang.OutOfMemoryError: Java heap space
LT  	at java.lang.String.<init>(String.java:820) ~[?:1.8.0_242-internal]
LT  	at java.lang.String.substring(String.java:2648) ~[?:1.8.0_242-internal]
LT  	at sun.util.resources.LocaleData$LocaleDataResourceBundleControl.toBundleName(LocaleData.java:273) ~[?:1.8.0_242-internal]
LT  	at java.util.ResourceBundle$Control.newBundle(ResourceBundle.java:2651) ~[?:1.8.0_242-internal]
LT  	at java.util.ResourceBundle.loadBundle(ResourceBundle.java:1518) ~[?:1.8.0_242-internal]
LT  	at java.util.ResourceBundle.findBundle(ResourceBundle.java:1482) ~[?:1.8.0_242-internal]
LT  	at java.util.ResourceBundle.getBundleImpl(ResourceBundle.java:1370) ~[?:1.8.0_242-internal]
LT  	at java.util.ResourceBundle.getBundle(ResourceBundle.java:899) ~[?:1.8.0_242-internal]
LT  	at sun.util.resources.LocaleData$1.run(LocaleData.java:167) ~[?:1.8.0_242-internal]
LT  	at sun.util.resources.LocaleData$1.run(LocaleData.java:163) ~[?:1.8.0_242-internal]
LT  	at java.security.AccessController.doPrivileged(AccessController.java:678) ~[?:1.8.0_242-internal]
LT  	at sun.util.resources.LocaleData.getBundle(LocaleData.java:163) ~[?:1.8.0_242-internal]
LT  	at sun.util.resources.LocaleData.getNumberFormatData(LocaleData.java:159) ~[?:1.8.0_242-internal]
LT  	at sun.util.locale.provider.LocaleResources.getDecimalFormatSymbolsData(LocaleResources.java:168) ~[?:1.8.0_242-internal]
LT  	at java.text.DecimalFormatSymbols.initialize(DecimalFormatSymbols.java:616) ~[?:1.8.0_242-internal]
LT  	at java.text.DecimalFormatSymbols.<init>(DecimalFormatSymbols.java:113) ~[?:1.8.0_242-internal]
LT  	at sun.util.locale.provider.DecimalFormatSymbolsProviderImpl.getInstance(DecimalFormatSymbolsProviderImpl.java:85) ~[?:1.8.0_242-internal]
LT  	at java.text.DecimalFormatSymbols.getInstance(DecimalFormatSymbols.java:180) ~[?:1.8.0_242-internal]
LT  	at java.util.Formatter.getZero(Formatter.java:2283) ~[?:1.8.0_242-internal]
LT  	at java.util.Formatter.<init>(Formatter.java:1892) ~[?:1.8.0_242-internal]
LT  	at java.util.Formatter.<init>(Formatter.java:1914) ~[?:1.8.0_242-internal]
LT  	at java.lang.String.format(String.java:4174) ~[?:1.8.0_242-internal]
LT  	at net.adoptopenjdk.loadTest.LoadTestRunner.run(LoadTestRunner.java:311) ~[bin/:?]
LT  	at net.adoptopenjdk.loadTest.LoadTest.runLoadTest(LoadTest.java:163) ~[bin/:?]
LT  	at net.adoptopenjdk.loadTest.LoadTest.main(LoadTest.java:119) ~[bin/:?]
LT  23:36:30.767 - suite.getInventory().getInventoryFileRef(): openjdk.test.load/config/inventories/lambdasAndStreams/lambda.xml
LT  23:36:30.768 - suite.isCreateDump() : false
LT  23:36:30.768 - Test failed
LT    Failure num.  = 1
LT    Test number   = 12
LT    Test details  = 'JUnit[net.adoptopenjdk.test.streams.TestParallelStreamOperations]'
LT    Suite number  = 0
LT    Thread number = 5
LT  >>> Captured test output >>>

@pshipton
Copy link
Member Author

@JasonFengJ9
Copy link
Member

https://ci.eclipse.org/openj9/job/Test_openjdk8_j9_sanity.system_ppc64le_linux_xl_Nightly/33

LT  00:40:40.209 - Uncaught exception: 
LT  java.lang.OutOfMemoryError: Java heap space
LT  	at net.adoptopenjdk.loadTest.LoadTestRunner.run(LoadTestRunner.java:309) ~[bin/:?]
LT  	at net.adoptopenjdk.loadTest.LoadTest.runLoadTest(LoadTest.java:163) ~[bin/:?]
LT  	at net.adoptopenjdk.loadTest.LoadTest.main(LoadTest.java:119) ~[bin/:?]
LT  00:40:54.318 - suite.getInventory().getInventoryFileRef(): openjdk.test.load/config/inventories/lambdasAndStreams/lambda.xml

@pshipton
Copy link
Member Author

pshipton commented Jan 9, 2020

@pshipton
Copy link
Member Author

@pshipton
Copy link
Member Author

pshipton commented Jan 28, 2020

@pshipton
Copy link
Member Author

https://ci.eclipse.org/openj9/job/Test_openjdk8_j9_sanity.system_ppc64le_linux_xl_OMR/9

@Mesbah-Alam can you please exclude this test, only for large heap builds if possible.

@dmitripivkine
Copy link
Contributor

Looks like this this issue is not GC but related. Changing label to comp:test

@lumpfish
Copy link
Contributor

lumpfish commented Feb 8, 2021

These OOMs should no longer occur as a result the changes made for adoptium/aqa-systemtest#379

@pshipton
Copy link
Member Author

pshipton commented Feb 8, 2021

@pshipton pshipton closed this as completed Feb 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants