-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance impact on Liberty InstantOn server startup time when enable semeruCloudCompiler #73
Comments
@tam512 Could you please post a similar for x86 (Azure)? From the Power log we see that more than 1 second is spent between these events:
I wonder if In between those two Liberty events there is a (successful) attempt to connect to the server. It's not clear to me whether that's creating a problem. |
Log from AKS
|
So, the server configuration update takes 1078 ms on Power and only 32 ms on x86. |
On x86 OCP 4.14 cluster, I also see this performance issue. Here is partial log
|
Running in a Microk8s Kubernetes cluster with LibertyKernel 24.0.0.2 and AcmeAirEE8 and InstantON and OpenJ9 Java21 private build and OLO I get
so I cannot reproduce the issue. |
@mpirvu noticed a discrepancy in the logs; the timestamps line up with the reported times on Power, but not on x86. On Power:
Time timestamps go from For x86:
The timestamps show 70ms elapsed but the reported times are similar to Power. Not exactly sure how to reconcile this. Is it possible for the log lines to be internally buffered and therefore delayed on x86? 70ms elapsed seems too fast for all the post-restore activity to have happened. @tjwatson any insight? |
The reported time looks accurate, regardless of the timestamps on the log, from what I can tell. The reported time is calculated by Liberty by taking the current time and subtracting the timestamp in I can see from the CRIU restore log that the process is restored in about 300 ms, so the remaining time must be taken by the JVM and Liberty doing post-restore activities. I would guess that latency between the application JVM and compilation server is the significant factor here, but where that latency is coming from isn't clear to me. As far as I can tell the application JVM and cloud compiler containers are actually on the same OCP worker, and therefore the same physical machine, so there is no real network between the two. |
@ymanton how does the log look like when JITServer is not enabled? |
Sorry for the late reply. I would attribute this to the fact that the |
@tam512 If I gave you a base liberty image with a customized JDK would you be able to create the Liberty+Apps image to try in this OpenShift environment? |
@mpirvu yes I can build a new app image with your customized Liberty image |
A new log from @tam512 shows
processOptionsPostRestore took only 8 ms so this rules out the SSL initialization (keys ans such) that is done once post restore. |
More instrumentation shows that the vast majority of the delay comes from establishing a typical TCP connection to the server.
|
Some new data: First, here's the breakdown of the compilation activity between when the process is restored and when the server reports startup is complete: InstantOn + Cloud Compiler:
InstantOn only
The slow connection time is reflected in the first compilation, but it looks like all subsequent non-AOT compilations are also slower, which suggests that all communication between the JVM and CC is slower. There are also far fewer AOT loads, which I assume is because they're stuck in the queue while the compilation threads on the JVM are waiting on the CC. Strangely enough, communication between the JVM container and CC container is not particularly slow. I verified this by running two netcat instances in the containers and timing the communication between them. Communication between the JVM and CC is also not particularly slow in the same containers. To verify this I ran The above suggests that the restored JVM in particular is the only one suffering from this problem, but only on this particular OCP instance. OCP instances on other clouds, and even other OCP instances on this cloud don't show the problem. I'll continue to try to isolate the problem. |
Another finding, adding The option shuts down the compiler and prevents connections to the remote compiler, and also throws out all of the previously compiled code. Slower startup is to be expected, but the fact that it's faster than with remote compilation enabled is definitely not expected. Remote compilation is somehow slowing things down beyond just the effects of fewer JIT compiled and AOT loaded methods. |
This problem appears to be independent of InstantOn and maybe even Semeru. If we disable InstantOn and start the server conventionally we still see the same 1-second connection time. With remote compilation once the compiler is idle we close the connection and later re-open if required, and here I can see that these subsequent connections are much faster, so the issue seems to be specifically tied to the first connection. I would suspect that there is probably nothing specific to Semeru here, and that any such first connection from any other program will show the same problem, but I would have to find a way to test that theory.
|
Test InstantOn on Power9 OCP 4.14, when semeruCloudCompiler is not enabled, server startup time is 0.290 seconds, when semeruCloudCompiler is enabled, server startup time is 1.312 seconds
The text was updated successfully, but these errors were encountered: