-
Notifications
You must be signed in to change notification settings - Fork 156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SessionConstraint may cause a hang #681
Comments
Within CoreSystem, there is this loop that appears. The only difference is the suffix of the transactionId attribute of the query, which makes me believe that a single, same request got stuck.
Within Session, there is this loop that also bears the same transactionId:
What I understand from the Session log, and studying the code with reference to the stack trace: |
please provide
|
Thank you for looking into this. Apologies for taking a while to get back to you, but I have been getting increasingly busy at work. I made a heap dump before (.hprof), which seems to have the status of all threads. Would this do? https://www.sendspace.com/file/e4be8p The version of OpenAM is this repository's head, a 14.8.1 snapshot. |
Hi again. It has been 2 days and I think the patch has been working well. Closing as fixed by #682. Thank you. |
Describe the bug
In a new production system that we are setting up for our customer, AM appears to hang under specific conditions. The exact conditions for causing this hang isn't very clear, but I need to fix it. I will write what I know about it, in this report.
For at least now, I believe it is related to SessionConstraint by catching a never-ending loop in the logs, related to calls to refresh(false) within DestroyOldestSession. The transaction ID remains the same, so I believe it is the same request that is never ending. Since this runs in Kubenetes, I couldn't really figure out how to attach a debugger to it... so it is regretable that nothing more concrete can be determined yet.
I will explain why I think this is so, below. I will also paste the relevant log sections in another comment.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
OpenAM should enforce session constraints without issue.
Screenshots
![am-hang-infinite-session-load](https://private-user-images.githubusercontent.com/3446348/287494810-6825d99d-2405-4cd9-a0d4-ac106e2f6ad3.PNG?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkzODMwMjIsIm5iZiI6MTczOTM4MjcyMiwicGF0aCI6Ii8zNDQ2MzQ4LzI4NzQ5NDgxMC02ODI1ZDk5ZC0yNDA1LTRjZDktYTBkNC1hYzEwNmUyZjZhZDMuUE5HP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI1MDIxMiUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNTAyMTJUMTc1MjAyWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9MDg1MGMyMmYwNDExZGRiOWFjYmEzMDhiN2VkYTJlYmExNjExODE4ODc4YmU1YTMwMzFkMzliZTY4ZGUyZTQzZCZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QifQ.YeRcdsXm9VigtyhSAyiL8Cxj3c6gJZ7kZTJqGgxQCCM)
Stack trace, in one of the threads that trigger never-ending reading of tokens:
In other case (not in the same run), there are too many instances of LDAPConnectionFactory etc, which is related to equally many instances of CachedConnectionPool:
![69K lots - LDAP](https://private-user-images.githubusercontent.com/3446348/287494881-294cbeb4-7329-463a-aa3e-69f76fa1441f.PNG?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkzODMwMjIsIm5iZiI6MTczOTM4MjcyMiwicGF0aCI6Ii8zNDQ2MzQ4LzI4NzQ5NDg4MS0yOTRjYmViNC03MzI5LTQ2M2EtYWEzZS02OWY3NmZhMTQ0MWYuUE5HP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI1MDIxMiUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNTAyMTJUMTc1MjAyWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9NWNmNzQxM2UyMTIxOGE4OTE3NmZmZjkwYzVmY2JlM2Q2MzkyMWYzN2E4MGRhNGM2ZTlmYTAxN2ZjMzVhODQ2OCZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QifQ.WnzH_-CuPYZeiBVaFYeMJJcSozJbpp5VmKIJ3gpMWM4)
Personally, I think that the memory issue is a symptom of the actual problem and is not really the problem itself. I don't really have strong evidence for this part, but I remember seeing connection pools getting created in the loop. If it happens fast enough, perhaps they are getting created faster than OpenDJ can handle.
Desktop (please complete the following information):
Additional context
I commented out this loop that was added here, like this. And the problem appears to have gone away:
However, I will only know for sure whether this matters, tomorrow. Usually, it happens very often whenever my customer is at work. I don't know why, but I assume it is caused by more sessions getting created and destroyed.
Within DestroyOldestAction, the action() method now looks like this:
There is a call to refresh(), as seen inside the stack trace above.
If we go back further (before 9f36814), this same method looked like this at some point:
The exception flow may occur if the token no longer exists. The Session.refresh() method would throw this exception:
This means that no token can be removed. However, the SessionConstraint class now has a loop, that will never exit if there is not at least 1 session destroyed.
I think that the original logic without the loop existed, because of this possibility that the session cannot be destroyed because it was already destroyed. In the original version, the creation of a new session would be blocked too.
But because of the code change by in 36677ca, it also now means that the session may always be created (if the loop in SessionConstraint does exit). I think, that could be permissible, if we can assume that the mechanism always works: the outcome is that the session is destroyed or that there is just no valid session to destroy. If not, then we may need to revert this change too.
But because there are no comments on why these changes were made in the first place, I do not know how to suggest a proper fix.
No longer blocking session creation, if old session cannot be destroyed: https://github.com/OpenIdentityPlatform/OpenAM/blame/02b7dd00e64302c7efe90a35fb43946c346dbb8c/openam-core/src/main/java/org/forgerock/openam/session/service/DestroyOldestAction.java#L82
Why exactly, do we need this loop? https://github.com/OpenIdentityPlatform/OpenAM/blame/28f00ee2f3601c5b2e22388f4b786abd5aafae78/openam-core/src/main/java/com/iplanet/dpro/session/service/SessionConstraint.java#L158
@vharseko would you be able to help explain why these changes were made in such a way, so that I can help to suggest a fix? Or would you already know what to do? Thank you.
The text was updated successfully, but these errors were encountered: