Test_openjdk8_j9_sanity.system_x86-64_linux_xl LambdaLoadTest_ConcurrentScavenge_0 OOM #6475

pshipton · 2019-07-16T14:58:06Z

https://ci.eclipse.org/openj9/job/Test_openjdk8_j9_sanity.system_x86-64_linux_xl_Nightly/98

LT  06:46:01.051 - Completed 3.0%. Number of tests started=6 (+0)
LT  stderr JVMDUMP039I Processing dump event "systhrow", detail "java/lang/OutOfMemoryError" at 2019/07/16 06:46:13 - please wait.
LT  stderr JVMDUMP039I Processing dump event "systhrow", detail "java/lang/OutOfMemoryError" at 2019/07/16 06:46:13 - please wait.

The text was updated successfully, but these errors were encountered:

pshipton · 2019-07-16T15:29:22Z

1TISIGINFO Dump Event "systhrow" (00040000) Detail "java/lang/OutOfMemoryError" "Java heap space" received

@amicic can somebody take a look at this please.

amicic · 2019-07-16T19:36:40Z

I was eventually able to reproduce in local environment. It's due to excessive GC. Will investigate further.

amicic · 2019-07-22T16:32:02Z

Reproducibility is about 1/20.

It does not seem to be Concurrent Scavenger related. I could reproduce this even with STW Scavenger and in fact with Optthruput GC policy.

It will probably occur only in Java 8 and only with XL JVM, because heap is set to 512M and there is more stress with XL JVM.

This is not even a recent regression. It would occur in Release 0.11, too.

I suspect that the test itself is broken. Normally, it starts 6 concurrent test treads and they finish within 1-2 seconds, with very low memory (heap) requirements. However, occasionally one of the threads would not terminate, but keep allocating, and more importantly retaining relatively large amount objects (200-300MB). If only one such threads 'misbehaves' test would timeout, but still PASS! However, if 2 such threads misbehave, live set would approximately double, the heap (512M) would fill up, there would be constant GCing and OOM would be thrown due to excessive GC.

Here is the output when 2 of 6 threads would not complete as normal:

LT 08:41:45.813 - Starting thread. Suite=0 thread=0
LT 08:41:45.814 - Starting thread. Suite=0 thread=1
LT 08:41:45.821 - Starting thread. Suite=0 thread=2
LT 08:41:45.822 - Starting thread. Suite=0 thread=3
LT 08:41:45.822 - Starting thread. Suite=0 thread=4
LT 08:41:45.823 - Starting thread. Suite=0 thread=5
LT 08:41:46.075 - Thread completed. Suite=0 thread=0
LT 08:41:46.077 - Thread completed. Suite=0 thread=4
LT 08:41:46.077 - Thread completed. Suite=0 thread=5
LT 08:41:46.133 - Thread completed. Suite=0 thread=3
LT 08:42:06.503 - Completed 3.0%. Number of tests started=6
LT stderr JVMDUMP039I Processing dump event "systhrow", detail "java/lang/OutOfMemoryError" at 2019/07/22 08:42:26 - please wait.

I tried running with Hotspot, but for some reason I was not able to.
I think this should be looked by test team next and try to reproduce with Hotspot. If this thread misbehaviour occurs there too, the test is broken, otherwise VM team needs to investigate further.

pshipton · 2019-07-23T20:13:47Z

@Mesbah-Alam can you please help Aleks run the test with Hotspot.
@llxia @smlambert fyi

Mesbah-Alam · 2019-07-24T12:40:36Z

@amicic , What GC policy should be used while running the test on HotSpot? Or do you predict that the same issue will be reproduced on HotSpot irrespective of any GC policy used?

This test was tagged to run on OpenJ9 only https://github.com/AdoptOpenJDK/openjdk-tests/blob/865ba275482fe6076a51730ad1451b16ea5f3314/systemtest/lambdaLoadTest/playlist.xml#L116

It uses the following GC options : -Xgcpolicy:gencon -Xgc:concurrentScavenge -- which I believe will not work on HotSpot.

amicic · 2019-07-24T16:04:36Z

GC policy is irrelevant for either OpenJ9 or Hotspot, Defaults are OK.

If we see messages like (which occur approximately 1 in 4 runs for OpenJ9)

LT 08:42:06.503 - Completed 3.0%. Number of tests started=6

with HotSpot, that would indicate that test is somewhat broken/inconstant.

Once it occurs, and lasts for some time (10s of seconds), in OpenJ9 it may lead to OOM (due to excessive GC). Not necessarily for HotSpot, since they may have different criteria for 'excessive GC'. Not always even would OpenJ9 lead to OOM if we see these messages.

So, looking into just PASS/FAIL of the test, we cannot determine if the test has a problem with HotSpot, too. We have to look into occurrence of the above log message.

Mesbah-Alam · 2019-07-24T16:25:25Z

@amicic - thanks for the clarification.

It will probably occur only in Java 8 and only with XL JVM, because heap is set to 512M and there is more stress with XL JVM

There is no HotSpot Large Heap Linux x64 JDK8 at Adopt : https://adoptopenjdk.net/nightly.html?variant=openjdk8&jvmVariant=hotspot

Would testing with regular HotSpot Linux x64 JDK8 suffice?

amicic · 2019-07-24T16:27:58Z

Does not have to be Large Heap variant (which I believe could be enabled by an option, from the same VM). although Large Heap variant is more likely to cause OOM (for OpenJ9, anyway). But again, since we do not look for OOM, but for the log messages, XL or not is irrelevant.

Mesbah-Alam · 2019-07-24T20:06:13Z

Hi @amicic,

20x Grinder on Hotspot does not reproduce the issue: https://ci.adoptopenjdk.net/view/Test_grinder/job/Grinder-systemtest/179/tapResults/

40x Grinder on HotSpot also does not reproduce the issue: https://ci.adoptopenjdk.net/view/Test_grinder/job/Grinder-systemtest/180/tapResults/

Ran with JDK 8, Linux x64 (regular heap size) with "NoOptions" mode. Is thhere any other option we could try to reproduce the issue?

amicic · 2019-07-25T00:28:23Z

Indeed, none of these 60 runs had that intermediate 'Completed...' message.

One difference was that OpenJ9 runs were with 6 threads and these were with only 2.

Could you repeat it with 6 threads?

amicic · 2019-07-25T00:36:40Z

For clarification, I'm again showing the problematic output, but with test actually passed (occurs 1 in 4 runs with OpenJ9):

15:05:44.899 - Starting thread. Suite=0 thread=0
15:05:44.900 - Starting thread. Suite=0 thread=1
15:05:44.900 - Starting thread. Suite=0 thread=2
15:05:44.900 - Starting thread. Suite=0 thread=3
15:05:44.901 - Starting thread. Suite=0 thread=4
15:05:44.901 - Starting thread. Suite=0 thread=5
15:05:45.055 - Thread completed. Suite=0 thread=5
15:05:45.074 - Thread completed. Suite=0 thread=2
15:05:45.075 - Thread completed. Suite=0 thread=0
15:05:45.117 - Thread completed. Suite=0 thread=4
15:05:45.141 - Thread completed. Suite=0 thread=3
15:06:04.966 - Completed 3.0%. Number of tests started=6
15:06:10.599 - Thread completed. Suite=0 thread=1
15:06:10.626 - Test stopped due to reaching runtime limit
15:06:10.626 - Load test completed
15:06:10.626 -   Ran     : 6
15:06:10.626 -   Passed  : 6
15:06:10.626 -   Failed  : 0
15:06:10.627 -   Result  : PASSED

This is normal output:

09:16:26.857 - Starting thread. Suite=0 thread=0
09:16:26.858 - Starting thread. Suite=0 thread=1
09:16:26.858 - Starting thread. Suite=0 thread=2
09:16:26.858 - Starting thread. Suite=0 thread=3
09:16:26.859 - Starting thread. Suite=0 thread=4
09:16:26.859 - Starting thread. Suite=0 thread=5
09:16:27.114 - Thread completed. Suite=0 thread=2
09:16:27.114 - Thread completed. Suite=0 thread=1
09:16:27.146 - Thread completed. Suite=0 thread=5
09:16:27.260 - Thread completed. Suite=0 thread=3
09:16:27.370 - Thread completed. Suite=0 thread=0
09:16:27.416 - Thread completed. Suite=0 thread=4
09:16:27.460 - Test stopped due to reaching runtime limit
09:16:27.461 - Load test completed
09:16:27.461 -   Ran     : 6
09:16:27.461 -   Passed  : 6
09:16:27.461 -   Failed  : 0
09:16:27.461 -   Result  : PASSED

Besides the extra message, one can see that it took extra 25 sec for 'bad' test (for one of six threads) to complete, which normally takes only about 1 sec.

Mesbah-Alam · 2019-07-25T02:57:53Z

One difference was that OpenJ9 runs were with 6 threads and these were with only 2.

Hi @amicic,

Here's how the test sets the number of threads it will use:

.setSuiteThreadCount(cpuCount - 2, 2) // Leave 1 cpu for the JIT, 1 cpu for GC and set min 2

Where, cpuCount = Runtime.getRuntime().availableProcessors()

Could you please share the name of the machine on which you were able to produce the error? May be I should use the same machine for HotSpot runs too.

amicic · 2019-07-25T13:29:39Z

I was using my personal cloud (Fyre) instance, that had 8 threads.

Mesbah-Alam · 2019-07-29T18:15:52Z

Running the test in an internal machine with 8 cpu's produce results where we do see similar unusual behaviour where 1 or 2 threads take extra 20sec:

14:03:49  LT  11:03:47.136 - Thread completed. Suite=0 thread=0
14:03:49  LT  11:03:47.155 - Thread completed. Suite=0 thread=5
14:03:49  LT  11:03:47.312 - Thread completed. Suite=0 thread=2
14:03:49  LT  11:03:47.452 - Thread completed. Suite=0 thread=4
14:04:14  LT  11:04:07.202 - Completed 3.0%. Number of tests started=6
14:04:18  LT  11:04:17.407 - Thread completed. Suite=0 thread=3
14:04:18  LT  11:04:17.458 - Thread completed. Suite=0 thread=1
14:04:18  LT  11:04:17.496 - Test stopped due to reaching runtime limit
14:04:18  LT  11:04:17.497 - Load test completed

Mesbah-Alam · 2019-07-29T18:16:47Z

Here's the main class that drives the "lambda" load: https://github.com/AdoptOpenJDK/openjdk-systemtest/blob/master/openjdk.test.load/src/test.load/net/adoptopenjdk/stf/LambdaLoadTest.java

Lists of tests that are run in the load are here: https://github.com/AdoptOpenJDK/openjdk-systemtest/blob/master/openjdk.test.load/config/inventories/lambdasAndStreams/lambda.xml

Mesbah-Alam · 2019-07-29T18:18:33Z

Copying @amicic's comments from Slack:

As for the the original failure, a quick workaround is to provide -Xmx1G to avoid OOMs

Inconsistency could possibly be explained by .setSuiteRandomSelection(); // Randomly pick the next test each time

Perhaps not all tests are ran all the time and some genuinely need more time and have larger live set (more heap required) (edited); although having 15 or so tests in the list and running 200 (as per setSuiteNumTests(200) ), likely all are run all the time.

The fact that we don't see a problem with 2 test threads and we do with 6, is also probably relevant. It somewhat suggests some race condition in the test (which are generally more likely with more threads).

Mesbah-Alam · 2019-07-29T18:22:23Z

Hi @pshipton, could you please share your comments as to whether or not it is worthwhile pursuing this investigation further? (or should we end this issue by setting -Xmx1G to avoid OOMs?)

pshipton · 2019-08-13T21:00:04Z

@Mesbah-Alam it seems worth more investigation in order to have a robust test. Were you able to reproduce the problem with Hotspot using 8 cpus? Your comment doesn't say.

Hard coding a bigger heap size doesn't seem like a fix, but a hack which may not always work. That said, it does make sense that more threads can run more concurrent testing, which will have a bigger memory requirement. Perhaps the test should limit the number of threads created.

Can you get a core file for the OOM failure, and look at it with Eclipse MAT to see what is consuming all the heap space?

pshipton · 2019-08-13T21:01:14Z

and btw, when testing with Hotspot to see if the issue can be reproduced, -Xmx512m should be used.

JasonFengJ9 · 2019-12-06T13:00:33Z

Another LambdaLoadTest_ConcurrentScavenge_0_FAILED
https://ci.eclipse.org/openj9/job/Test_openjdk8_j9_sanity.system_ppc64le_linux_xl_Nightly/6/tapResults/

pshipton · 2019-12-06T13:37:06Z

A 5x grinder on ppc64le_linux_xl failed 3/5
https://ci.eclipse.org/openj9/job/Grinder/563/

pshipton · 2019-12-12T12:13:45Z

https://ci.eclipse.org/openj9/job/Test_openjdk8_j9_sanity.system_ppc64le_linux_xl_Nightly/11

JasonFengJ9 · 2019-12-18T12:26:18Z

https://ci.eclipse.org/openj9/job/Test_openjdk8_j9_sanity.system_ppc64le_linux_xl_Nightly/18/tapResults/

LT  23:36:11.072 - First failure detected by thread: load-5. Not creating dumps as no dump generation is requested for this load test
LT  23:36:09.041 - Uncaught exception: 
LT  java.lang.OutOfMemoryError: Java heap space
LT  	at java.lang.String.<init>(String.java:820) ~[?:1.8.0_242-internal]
LT  	at java.lang.String.substring(String.java:2648) ~[?:1.8.0_242-internal]
LT  	at sun.util.resources.LocaleData$LocaleDataResourceBundleControl.toBundleName(LocaleData.java:273) ~[?:1.8.0_242-internal]
LT  	at java.util.ResourceBundle$Control.newBundle(ResourceBundle.java:2651) ~[?:1.8.0_242-internal]
LT  	at java.util.ResourceBundle.loadBundle(ResourceBundle.java:1518) ~[?:1.8.0_242-internal]
LT  	at java.util.ResourceBundle.findBundle(ResourceBundle.java:1482) ~[?:1.8.0_242-internal]
LT  	at java.util.ResourceBundle.getBundleImpl(ResourceBundle.java:1370) ~[?:1.8.0_242-internal]
LT  	at java.util.ResourceBundle.getBundle(ResourceBundle.java:899) ~[?:1.8.0_242-internal]
LT  	at sun.util.resources.LocaleData$1.run(LocaleData.java:167) ~[?:1.8.0_242-internal]
LT  	at sun.util.resources.LocaleData$1.run(LocaleData.java:163) ~[?:1.8.0_242-internal]
LT  	at java.security.AccessController.doPrivileged(AccessController.java:678) ~[?:1.8.0_242-internal]
LT  	at sun.util.resources.LocaleData.getBundle(LocaleData.java:163) ~[?:1.8.0_242-internal]
LT  	at sun.util.resources.LocaleData.getNumberFormatData(LocaleData.java:159) ~[?:1.8.0_242-internal]
LT  	at sun.util.locale.provider.LocaleResources.getDecimalFormatSymbolsData(LocaleResources.java:168) ~[?:1.8.0_242-internal]
LT  	at java.text.DecimalFormatSymbols.initialize(DecimalFormatSymbols.java:616) ~[?:1.8.0_242-internal]
LT  	at java.text.DecimalFormatSymbols.<init>(DecimalFormatSymbols.java:113) ~[?:1.8.0_242-internal]
LT  	at sun.util.locale.provider.DecimalFormatSymbolsProviderImpl.getInstance(DecimalFormatSymbolsProviderImpl.java:85) ~[?:1.8.0_242-internal]
LT  	at java.text.DecimalFormatSymbols.getInstance(DecimalFormatSymbols.java:180) ~[?:1.8.0_242-internal]
LT  	at java.util.Formatter.getZero(Formatter.java:2283) ~[?:1.8.0_242-internal]
LT  	at java.util.Formatter.<init>(Formatter.java:1892) ~[?:1.8.0_242-internal]
LT  	at java.util.Formatter.<init>(Formatter.java:1914) ~[?:1.8.0_242-internal]
LT  	at java.lang.String.format(String.java:4174) ~[?:1.8.0_242-internal]
LT  	at net.adoptopenjdk.loadTest.LoadTestRunner.run(LoadTestRunner.java:311) ~[bin/:?]
LT  	at net.adoptopenjdk.loadTest.LoadTest.runLoadTest(LoadTest.java:163) ~[bin/:?]
LT  	at net.adoptopenjdk.loadTest.LoadTest.main(LoadTest.java:119) ~[bin/:?]
LT  23:36:30.767 - suite.getInventory().getInventoryFileRef(): openjdk.test.load/config/inventories/lambdasAndStreams/lambda.xml
LT  23:36:30.768 - suite.isCreateDump() : false
LT  23:36:30.768 - Test failed
LT    Failure num.  = 1
LT    Test number   = 12
LT    Test details  = 'JUnit[net.adoptopenjdk.test.streams.TestParallelStreamOperations]'
LT    Suite number  = 0
LT    Thread number = 5
LT  >>> Captured test output >>>

pshipton · 2019-12-30T15:03:42Z

https://ci.eclipse.org/openj9/job/Test_openjdk8_j9_sanity.system_ppc64le_linux_xl_Nightly/28

JasonFengJ9 · 2020-01-03T14:11:22Z

https://ci.eclipse.org/openj9/job/Test_openjdk8_j9_sanity.system_ppc64le_linux_xl_Nightly/33

LT  00:40:40.209 - Uncaught exception: 
LT  java.lang.OutOfMemoryError: Java heap space
LT  	at net.adoptopenjdk.loadTest.LoadTestRunner.run(LoadTestRunner.java:309) ~[bin/:?]
LT  	at net.adoptopenjdk.loadTest.LoadTest.runLoadTest(LoadTest.java:163) ~[bin/:?]
LT  	at net.adoptopenjdk.loadTest.LoadTest.main(LoadTest.java:119) ~[bin/:?]
LT  00:40:54.318 - suite.getInventory().getInventoryFileRef(): openjdk.test.load/config/inventories/lambdasAndStreams/lambda.xml

pshipton · 2020-01-09T01:28:30Z

https://ci.eclipse.org/openj9/job/Test_openjdk8_j9_sanity.system_ppc64le_linux_xl_Release/6

pshipton · 2020-01-22T13:13:04Z

https://ci.eclipse.org/openj9/job/Test_openjdk8_j9_sanity.system_x86-64_linux_xl_Nightly/259

pshipton · 2020-01-28T13:48:41Z

https://ci.eclipse.org/openj9/job/Test_openjdk8_j9_sanity.system_ppc64le_linux_xl_Nightly/57
https://ci.eclipse.org/openj9/job/Test_openjdk8_j9_sanity.system_ppc64le_linux_xl_OMR/7

@amicic @Mesbah-Alam What is the plan to fix this test? Should we exclude it until it can run properly?

pshipton · 2020-01-29T18:01:13Z

https://ci.eclipse.org/openj9/job/Test_openjdk8_j9_sanity.system_ppc64le_linux_xl_OMR/9

@Mesbah-Alam can you please exclude this test, only for large heap builds if possible.

dmitripivkine · 2020-03-05T18:29:47Z

Looks like this this issue is not GC but related. Changing label to comp:test

lumpfish · 2021-02-08T15:37:19Z

These OOMs should no longer occur as a result the changes made for adoptium/aqa-systemtest#379

pshipton · 2021-02-08T16:10:55Z

The test no longer seems to be excluded, the exclude removed by adoptium/aqa-tests#2229 , and I see LambdaLoadTest_CS_5m running in the builds.
https://github.com/AdoptOpenJDK/openjdk-tests/blob/master/system/lambdaLoadTest/playlist.xml#L58-L76
https://ci.eclipse.org/openj9/job/Test_openjdk8_j9_sanity.system_ppc64le_linux_mixed_Nightly/6/tapResults/

pshipton added the test failure label Jul 16, 2019

pshipton added the comp:gc label Jul 16, 2019

pshipton mentioned this issue Dec 6, 2019

ppc64le_linux_xl LambdaLoadTest_ConcurrentScavenge_0 OutOfMemoryError #8004

Closed

Mesbah-Alam mentioned this issue Feb 5, 2020

Exclude LambdaLoadTest_ConcurrentScavenge from ppc64le_linux_xl adoptium/aqa-tests#1602

Merged

dmitripivkine added comp:test and removed comp:gc labels Mar 5, 2020

llxia mentioned this issue Feb 3, 2021

Create ParallelStreamsLoadTest adoptium/aqa-tests#2229

Merged

pshipton closed this as completed Feb 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test_openjdk8_j9_sanity.system_x86-64_linux_xl LambdaLoadTest_ConcurrentScavenge_0 OOM #6475

Test_openjdk8_j9_sanity.system_x86-64_linux_xl LambdaLoadTest_ConcurrentScavenge_0 OOM #6475

pshipton commented Jul 16, 2019

pshipton commented Jul 16, 2019

amicic commented Jul 16, 2019

amicic commented Jul 22, 2019 •

edited

Loading

pshipton commented Jul 23, 2019

Mesbah-Alam commented Jul 24, 2019 •

edited

Loading

amicic commented Jul 24, 2019 •

edited

Loading

Mesbah-Alam commented Jul 24, 2019

amicic commented Jul 24, 2019

Mesbah-Alam commented Jul 24, 2019 •

edited

Loading

amicic commented Jul 25, 2019 •

edited

Loading

amicic commented Jul 25, 2019

Mesbah-Alam commented Jul 25, 2019

amicic commented Jul 25, 2019

Mesbah-Alam commented Jul 29, 2019

Mesbah-Alam commented Jul 29, 2019

Mesbah-Alam commented Jul 29, 2019 •

edited

Loading

Mesbah-Alam commented Jul 29, 2019

pshipton commented Aug 13, 2019

pshipton commented Aug 13, 2019

JasonFengJ9 commented Dec 6, 2019

pshipton commented Dec 6, 2019

pshipton commented Dec 12, 2019

JasonFengJ9 commented Dec 18, 2019

pshipton commented Dec 30, 2019

JasonFengJ9 commented Jan 3, 2020

pshipton commented Jan 9, 2020

pshipton commented Jan 22, 2020

pshipton commented Jan 28, 2020 •

edited

Loading

pshipton commented Jan 29, 2020

dmitripivkine commented Mar 5, 2020

lumpfish commented Feb 8, 2021 •

edited

Loading

pshipton commented Feb 8, 2021

Test_openjdk8_j9_sanity.system_x86-64_linux_xl LambdaLoadTest_ConcurrentScavenge_0 OOM #6475

Test_openjdk8_j9_sanity.system_x86-64_linux_xl LambdaLoadTest_ConcurrentScavenge_0 OOM #6475

Comments

pshipton commented Jul 16, 2019

pshipton commented Jul 16, 2019

amicic commented Jul 16, 2019

amicic commented Jul 22, 2019 • edited Loading

pshipton commented Jul 23, 2019

Mesbah-Alam commented Jul 24, 2019 • edited Loading

amicic commented Jul 24, 2019 • edited Loading

Mesbah-Alam commented Jul 24, 2019

amicic commented Jul 24, 2019

Mesbah-Alam commented Jul 24, 2019 • edited Loading

amicic commented Jul 25, 2019 • edited Loading

amicic commented Jul 25, 2019

Mesbah-Alam commented Jul 25, 2019

amicic commented Jul 25, 2019

Mesbah-Alam commented Jul 29, 2019

Mesbah-Alam commented Jul 29, 2019

Mesbah-Alam commented Jul 29, 2019 • edited Loading

Mesbah-Alam commented Jul 29, 2019

pshipton commented Aug 13, 2019

pshipton commented Aug 13, 2019

JasonFengJ9 commented Dec 6, 2019

pshipton commented Dec 6, 2019

pshipton commented Dec 12, 2019

JasonFengJ9 commented Dec 18, 2019

pshipton commented Dec 30, 2019

JasonFengJ9 commented Jan 3, 2020

pshipton commented Jan 9, 2020

pshipton commented Jan 22, 2020

pshipton commented Jan 28, 2020 • edited Loading

pshipton commented Jan 29, 2020

dmitripivkine commented Mar 5, 2020

lumpfish commented Feb 8, 2021 • edited Loading

pshipton commented Feb 8, 2021

amicic commented Jul 22, 2019 •

edited

Loading

Mesbah-Alam commented Jul 24, 2019 •

edited

Loading

amicic commented Jul 24, 2019 •

edited

Loading

Mesbah-Alam commented Jul 24, 2019 •

edited

Loading

amicic commented Jul 25, 2019 •

edited

Loading

Mesbah-Alam commented Jul 29, 2019 •

edited

Loading

pshipton commented Jan 28, 2020 •

edited

Loading

lumpfish commented Feb 8, 2021 •

edited

Loading