-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JBTM-3971 Set the proper state while recovering timed out LRAs #27
base: main
Are you sure you want to change the base?
Conversation
Started testing this pull request with JACOCO profile: https://ci-jenkins-csb-narayana.apps.ocp-c1.prod.psi.redhat.com/job/btny-pulls-lra/PROFILE=JACOCO,jdk=openJDK17,label=jnlp-agent/98/ |
Started testing this pull request with LRA profile: https://ci-jenkins-csb-narayana.apps.ocp-c1.prod.psi.redhat.com/job/btny-pulls-lra/PROFILE=LRA,jdk=openJDK17,label=jnlp-agent/98/ |
LRA profile tests failed (https://ci-jenkins-csb-narayana.apps.ocp-c1.prod.psi.redhat.com/job/btny-pulls-lra/PROFILE=LRA,jdk=openJDK17,label=jnlp-agent/98/): LRA Test failed with failures in arq profile |
JACOCO profile tests failed (https://ci-jenkins-csb-narayana.apps.ocp-c1.prod.psi.redhat.com/job/btny-pulls-lra/PROFILE=JACOCO,jdk=openJDK17,label=jnlp-agent/98/): LRA Test failed with failures in arq profile |
627193f
to
37676a8
Compare
Started testing this pull request with JACOCO profile: https://ci-jenkins-csb-narayana.apps.ocp-c1.prod.psi.redhat.com/job/btny-pulls-lra/PROFILE=JACOCO,jdk=openJDK17,label=jnlp-agent/99/ |
Started testing this pull request with LRA profile: https://ci-jenkins-csb-narayana.apps.ocp-c1.prod.psi.redhat.com/job/btny-pulls-lra/PROFILE=LRA,jdk=openJDK17,label=jnlp-agent/99/ |
JACOCO profile tests failed (https://ci-jenkins-csb-narayana.apps.ocp-c1.prod.psi.redhat.com/job/btny-pulls-lra/PROFILE=JACOCO,jdk=openJDK17,label=jnlp-agent/99/): narayana build failed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice fix - the PR is still marked draft but I'm approving it anyway.
[Edit] I asked for a re-review. Please let me know when you are happy with your fix.
LRA profile tests failed (https://ci-jenkins-csb-narayana.apps.ocp-c1.prod.psi.redhat.com/job/btny-pulls-lra/PROFILE=LRA,jdk=openJDK17,label=jnlp-agent/99/): LRA Test failed with failures in arq profile |
37676a8
to
a4b7d66
Compare
.put(null); | ||
|
||
Assert.assertEquals("LRA participant action", 200, response.getStatus()); | ||
Long SHORTER_TIMEOUT = 1000L; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using here a shorter timeout (1 sec) so that the shortLRA times out when the server is down.
Started testing this pull request with JACOCO profile: https://ci-jenkins-csb-narayana.apps.ocp-c1.prod.psi.redhat.com/job/btny-pulls-lra/PROFILE=JACOCO,jdk=openJDK17,label=jnlp-agent/100/ |
Started testing this pull request with LRA profile: https://ci-jenkins-csb-narayana.apps.ocp-c1.prod.psi.redhat.com/job/btny-pulls-lra/PROFILE=LRA,jdk=openJDK17,label=jnlp-agent/100/ |
JACOCO profile tests passed - Job complete https://ci-jenkins-csb-narayana.apps.ocp-c1.prod.psi.redhat.com/job/btny-pulls-lra/PROFILE=JACOCO,jdk=openJDK17,label=jnlp-agent/100/ |
LRA profile tests passed - Job complete https://ci-jenkins-csb-narayana.apps.ocp-c1.prod.psi.redhat.com/job/btny-pulls-lra/PROFILE=LRA,jdk=openJDK17,label=jnlp-agent/100/ |
test/crash/src/test/java/io/narayana/lra/arquillian/LRACoordinatorRecoveryIT.java
Outdated
Show resolved
Hide resolved
|
||
for (Map.Entry<String, String> entry : containerDeploymentMap.entrySet()) { | ||
stopContainer(entry.getKey(), entry.getValue()); | ||
} | ||
clearRecoveryLogFromFS(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand this is clean up after the server is stopped so it's fine with me for this to move. I would be tempted to see if it's worth (in a follow up) being able to assert a fail if there is something there but this is just a thought and maybe some tests would still expect some file system records
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point Tom. I think we wanted this clean up to prevent a test failure from contaminating following tests and appear (e.g. in the CI) that multiple tests have failed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be clear I see this topic as different to #27 (comment) but I will resolve this one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I am not mandating the fail checks being added but to add readability of my review I will unresolve it but it does not need addressing IMO
clearRecoveryLogFromObjectStore(); | ||
clearRecoveryLogFromFS(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am OK with this being moved before the start of the container, it seems clean to me to do that. I would be tempted to fail the test if there was something in the objectstore after starting the new test (or make the assumption there should be nothing and remove the call to clearRecoveryLogFromObjectStore but these are remarks rather than requests for change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Tom, I think this is related to my previous comment. In the CI we have the --fail-at-end flag when running tests. So we want to make sure that a failure is not impacting the following tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess part of what I am saying though is that I can't imagine a way that clearRecoveryLogFromObjectStore should have found a transaction and if it did something in Narayana is broken and needs fixing so a "fail()" would seem appropriate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyway, it was a remark so I will resolve this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I am not mandating the fail checks being added but to add readability of my review I will unresolve it but it does not need addressing IMO
coordinator/src/main/java/io/narayana/lra/coordinator/domain/model/LongRunningAction.java
Show resolved
Hide resolved
@@ -319,7 +319,12 @@ public boolean restore_state(InputObjectState os, int ot) { | |||
} | |||
|
|||
if (status == LRAStatus.Active) { | |||
status = LRAStatus.Cancelling; // transition from Active to Cancelling | |||
if (getSize(pendingList) == 0 && getSize(preparedList) == 0 && getSize(heuristicList) == 0) { | |||
updateState(LRAStatus.Cancelled); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please can you share the pre-crash condition for the LRA such that (getSize(pendingList) == 0 && getSize(preparedList) == 0 && getSize(heuristicList) == 0)
but the status was Active? Is this basically an LRA that had been created but no participants registered so we are just moving it to Cancelled as the TTL is expired. I guess so but if so maybe it's worth a comment because it at least made my curious to know how this could happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is the case tested here: https://github.com/jbosstm/lra/pull/27/files#diff-1a03090f8ce8ff7635f6bc880a133ff1d5f01a56955be2da86c33199963bf879R151
You are right, it is the case where no participants are registered and the LRA is active. For instance if we create a shortLRA and then the server crashes before the participants are registered, then when the server restarts the LRA is already expired.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, I think it might be worth a comment in the code just to mention that as it's not immediately obvious but I suppose it's not essential to do that. Thank you for the answer on here, though :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@marcosgopen please can I ask you to confirm if this change is related in some way to JBTM-3966 or is it unrelated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I think the two commits in the pull request should be prefixed with their numbers and then this would be clearer - is that OK with you, please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And would the second commit be OK in it's own pull request or does it require the first commit? If so I think some linkage between https://issues.redhat.com/browse/JBTM-3966 and https://issues.redhat.com/browse/JBTM-3971 should be made in the JIRAs? I am not sure which but maybe JBTM-3966 should be marked as blocked by JBTM-3971?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think from the JIRA linkage now it can now be understood that JBTM-3971 really should be part of the same pull request (or merged first, so that JBTM-3966 will be more predictable) - thank you!
test/crash/src/test/java/io/narayana/lra/arquillian/LRACoordinatorRecoveryIT.java
Outdated
Show resolved
Hide resolved
|
||
// The LRA transaction could have been started via lraClient but it is useful to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This statement is a little confusing because on the new line 155 lraClient is being used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I see where it was moved from in the old diff and my thought is that the intention of the test before was to use the annotated client directly. I am not -1 to change to using lraClient (as long as we have tests for the filters elsewhere) but it seems like this comment is no longer appropriate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, I'll remove that comment.
As I mentioned above I will restore (with little changes) the previous test testing he annotated client.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, thank you!
// Verifies that the resource was notified that the LRA finished | ||
String listenerStatus = getStatusFromListener(lraListenerURI); | ||
Assert.assertTrue(String.format("LRA %s should have cancelled", shortLRA.toString()), | ||
status == null || status == LRAStatus.Cancelled); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible for status to be null? What is the condition for that to happen, please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be null only in case the server has not recovered the LRA yet. Since we have the wait() call before that we could harden the test so that we test the recovery is finished and the LRA is not null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a "status == null" should be considered a failure if it means the server should know something about the LRA but does not
test/crash/src/test/java/io/narayana/lra/arquillian/LRACoordinatorRecoveryIT.java
Outdated
Show resolved
Hide resolved
@@ -247,7 +213,6 @@ public void lraCoordinatorRecoveryTwoLRAs(@ArquillianResource @OperateOnDeployme | |||
} | |||
|
|||
lraClient.closeLRA(longLRA); | |||
lraClient.closeLRA(shortLRA); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please can you mention why removing this call is definitely fine now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a mistake in the test: the waiting time (doWait) was not enough to trigger the shortLRA timeout and so it needed to be closed manually.
SHORT_TIMEOUT is 10.000
LRAListener.LRA_SHORT_TIMELIMIT is 5L
(LRAListener.LRA_SHORT_TIMELIMIT +1L ) * 1000 = 6000
Now the shortLRA is already closed at the end of the test. I think now it makes more sense.
BTW calling closeLRA to a cancelled LRA returns a 412 error response (precondition failed)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense to me now. So if the LRA was cancelled (rather than cancelling) before these changes, this line would have caused a test failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I believe so.
e0009a4
to
28c8ded
Compare
There have been a lot of changes since the version I approved and I'm not clear on why so many changes are needed.
Started testing this pull request with JACOCO profile: https://ci-jenkins-csb-narayana.apps.ocp-c1.prod.psi.redhat.com/job/btny-pulls-lra/PROFILE=JACOCO,jdk=openJDK17,label=jnlp-agent/101/ |
Started testing this pull request with LRA profile: https://ci-jenkins-csb-narayana.apps.ocp-c1.prod.psi.redhat.com/job/btny-pulls-lra/PROFILE=LRA,jdk=openJDK17,label=jnlp-agent/101/ |
test/crash/src/test/java/io/narayana/lra/arquillian/LRACoordinatorRecoveryIT.java
Show resolved
Hide resolved
LRA profile tests failed (https://ci-jenkins-csb-narayana.apps.ocp-c1.prod.psi.redhat.com/job/btny-pulls-lra/PROFILE=LRA,jdk=openJDK17,label=jnlp-agent/101/): LRA Test failed with failures in arq profile |
JACOCO profile tests passed - Job complete https://ci-jenkins-csb-narayana.apps.ocp-c1.prod.psi.redhat.com/job/btny-pulls-lra/PROFILE=JACOCO,jdk=openJDK17,label=jnlp-agent/101/ |
if (shortStatus == LRAStatus.Cancelling) { | ||
recover(); | ||
shortStatus = getStatus(shortLRA); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@marcosgopen please can you clarify why this block is a better approach than the previous block: https://github.com/jbosstm/lra/pull/27/files#diff-1a03090f8ce8ff7635f6bc880a133ff1d5f01a56955be2da86c33199963bf879L221-L227 they look to be somewhat similar to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After some recent changes (wildfly/wildfly#18499) the recovery is now triggered by the server startup, so it should not need to be triggered manually. (Edited)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, that's clear then. I think I would say "should not need to be triggered manually" but that is clearer now to why it's different. Thank you
test/crash/src/test/java/io/narayana/lra/arquillian/LRACoordinatorRecoveryIT.java
Outdated
Show resolved
Hide resolved
When shortLRAs time out when the server is down the abortLRA method is not called. When the server restart the scheduler is not restored because already timed out and the LRA status keeps being cancelling or closing indefinitely
28c8ded
to
616d01f
Compare
Started testing this pull request with JACOCO profile: https://ci-jenkins-csb-narayana.apps.ocp-c1.prod.psi.redhat.com/job/btny-pulls-lra/PROFILE=JACOCO,jdk=openJDK17,label=jnlp-agent/102/ |
Started testing this pull request with LRA profile: https://ci-jenkins-csb-narayana.apps.ocp-c1.prod.psi.redhat.com/job/btny-pulls-lra/PROFILE=LRA,jdk=openJDK17,label=jnlp-agent/102/ |
JACOCO profile tests passed - Job complete https://ci-jenkins-csb-narayana.apps.ocp-c1.prod.psi.redhat.com/job/btny-pulls-lra/PROFILE=JACOCO,jdk=openJDK17,label=jnlp-agent/102/ |
LRA profile tests failed (https://ci-jenkins-csb-narayana.apps.ocp-c1.prod.psi.redhat.com/job/btny-pulls-lra/PROFILE=LRA,jdk=openJDK17,label=jnlp-agent/102/): LRA Test failed with failures in arq profile |
// wait almost until the LRA is timed out) | ||
doWait((LRAListener.LRA_SHORT_TIMELIMIT * 1000) - 500); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please can I check if it is mandatory to wait here before restarting? If so perhaps the comment could be expanded to clarify why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, I'll expand my comment
616d01f
to
b748699
Compare
@@ -163,36 +186,19 @@ public void lraCoordinatorShortTimeoutLRA(@ArquillianResource @OperateOnDeployme | |||
|
|||
Assert.assertEquals("LRA participant action", 200, response.getStatus()); | |||
|
|||
// The LRA transaction could have been started via lraClient but it is useful to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please can I ask why this was moved before the restart of the container? This might end up a duplicate discussion of https://github.com/jbosstm/lra/pull/27/files#r1910297173 but I am not sure
Started testing this pull request with JACOCO profile: https://ci-jenkins-csb-narayana.apps.ocp-c1.prod.psi.redhat.com/job/btny-pulls-lra/PROFILE=JACOCO,jdk=openJDK17,label=jnlp-agent/103/ |
Started testing this pull request with LRA profile: https://ci-jenkins-csb-narayana.apps.ocp-c1.prod.psi.redhat.com/job/btny-pulls-lra/PROFILE=LRA,jdk=openJDK17,label=jnlp-agent/103/ |
|
||
// LRA with short timeout should have timed out and cancelled | ||
status = getStatus(new URI(lraId)); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand why this is being removed - it's the same reason as https://github.com/jbosstm/lra/pull/27/files#r1910235210
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I am wondering if it might be useful to add the replacement you made there (or something similar) because the doWait((LRAListener.LRA_SHORT_TIMELIMIT * 1000) - 500);
(on line 194) is not long enough to timeout the LRA and even though when combined with the time restartContainer(LRA_COORDINATOR_CONTAINER_QUALIFIER);
(from 196) takes, it is probably long enough to timeout the LRA, it might not be and so the LRAStatus status = getStatus(new URI(lraId));
on (now) line 200 might find the LRA Cancelling still and so the assert on 202-203 might fail?
JACOCO profile tests passed - Job complete https://ci-jenkins-csb-narayana.apps.ocp-c1.prod.psi.redhat.com/job/btny-pulls-lra/PROFILE=JACOCO,jdk=openJDK17,label=jnlp-agent/103/ |
LRA profile tests passed - Job complete https://ci-jenkins-csb-narayana.apps.ocp-c1.prod.psi.redhat.com/job/btny-pulls-lra/PROFILE=LRA,jdk=openJDK17,label=jnlp-agent/103/ |
} | ||
else { | ||
updateState(LRAStatus.Cancelling); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a question if we can move this to new code to use the scheduleCancellation function somehow. As I understand it abortLRA causes a locking issue but scheduleCancellation would hopefully be in a different thread.
The spec [1] says that a participant can ask to be notified of the outcome of the LRA via the @AfterLRA annotation as long as the LRA hasn't reached one of the terminal states. [1] https://download.eclipse.org/microprofile/microprofile-lra-2.0/microprofile-lra-spec-2.0.html#_quick_overview_of_annotations |
I put this on hold while I investigate further the issue and address your comments. |
https://issues.redhat.com/browse/JBTM-3971
When shortLRAs time out when the server is down the abortLRA method is not called. When the server restarts, the scheduler is not restored because already timed out and the LRA status keeps being cancelling indefinitely
Second commit is related to https://issues.redhat.com/browse/JBTM-3966
normal scheduler for cancelletion of short LRAs: https://github.com/jbosstm/lra/blob/main/coordinator/src/main/java/io/narayana/lra/coordinator/domain/model/LongRunningAction.java#L1105
the scheduler sets the status to cancelled when lists are empty: https://github.com/jbosstm/lra/blob/main/coordinator/src/main/java/io/narayana/lra/coordinator/domain/model/LongRunningAction.java#L1189
when a short LRA times out while the server is down that method is not called, and when it is restored the status goes to cancelling: https://github.com/jbosstm/lra/blob/main/coordinator/src/main/java/io/narayana/lra/coordinator/domain/model/LongRunningAction.java#L322
Eventually when the shortLRA is cancelled the status keeps being cancelling when it is considered finished : https://github.com/jbosstm/lra/blob/main/coordinator/src/main/java/io/narayana/lra/coordinator/domain/model/LongRunningAction.java#L502-L506
Please note that if a LRA is in cancelling status but all the lists are empty it is considered finished: https://github.com/jbosstm/lra/blob/main/coordinator/src/main/java/io/narayana/lra/coordinator/domain/model/LongRunningAction.java#L427-L441