Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JBTM-3971 Set the proper state while recovering timed out LRAs #27

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

marcosgopen
Copy link
Member

@marcosgopen marcosgopen commented Jan 9, 2025

https://issues.redhat.com/browse/JBTM-3971

When shortLRAs time out when the server is down the abortLRA method is not called. When the server restarts, the scheduler is not restored because already timed out and the LRA status keeps being cancelling indefinitely

Second commit is related to https://issues.redhat.com/browse/JBTM-3966

normal scheduler for cancelletion of short LRAs: https://github.com/jbosstm/lra/blob/main/coordinator/src/main/java/io/narayana/lra/coordinator/domain/model/LongRunningAction.java#L1105

the scheduler sets the status to cancelled when lists are empty: https://github.com/jbosstm/lra/blob/main/coordinator/src/main/java/io/narayana/lra/coordinator/domain/model/LongRunningAction.java#L1189

when a short LRA times out while the server is down that method is not called, and when it is restored the status goes to cancelling: https://github.com/jbosstm/lra/blob/main/coordinator/src/main/java/io/narayana/lra/coordinator/domain/model/LongRunningAction.java#L322

Eventually when the shortLRA is cancelled the status keeps being cancelling when it is considered finished : https://github.com/jbosstm/lra/blob/main/coordinator/src/main/java/io/narayana/lra/coordinator/domain/model/LongRunningAction.java#L502-L506

Please note that if a LRA is in cancelling status but all the lists are empty it is considered finished: https://github.com/jbosstm/lra/blob/main/coordinator/src/main/java/io/narayana/lra/coordinator/domain/model/LongRunningAction.java#L427-L441

@jbosstm-bot
Copy link

@jbosstm-bot
Copy link

@jbosstm-bot
Copy link

LRA profile tests failed (https://ci-jenkins-csb-narayana.apps.ocp-c1.prod.psi.redhat.com/job/btny-pulls-lra/PROFILE=LRA,jdk=openJDK17,label=jnlp-agent/98/): LRA Test failed with failures in arq profile

@jbosstm-bot
Copy link

JACOCO profile tests failed (https://ci-jenkins-csb-narayana.apps.ocp-c1.prod.psi.redhat.com/job/btny-pulls-lra/PROFILE=JACOCO,jdk=openJDK17,label=jnlp-agent/98/): LRA Test failed with failures in arq profile

@jbosstm-bot
Copy link

@jbosstm-bot
Copy link

@jbosstm-bot
Copy link

mmusgrov
mmusgrov previously approved these changes Jan 9, 2025
Copy link
Member

@mmusgrov mmusgrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice fix - the PR is still marked draft but I'm approving it anyway.

[Edit] I asked for a re-review. Please let me know when you are happy with your fix.

@jbosstm-bot
Copy link

LRA profile tests failed (https://ci-jenkins-csb-narayana.apps.ocp-c1.prod.psi.redhat.com/job/btny-pulls-lra/PROFILE=LRA,jdk=openJDK17,label=jnlp-agent/99/): LRA Test failed with failures in arq profile

.put(null);

Assert.assertEquals("LRA participant action", 200, response.getStatus());
Long SHORTER_TIMEOUT = 1000L;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using here a shorter timeout (1 sec) so that the shortLRA times out when the server is down.

@jbosstm-bot
Copy link

@jbosstm-bot
Copy link

@mmusgrov mmusgrov self-requested a review January 9, 2025 17:20
@jbosstm-bot
Copy link

@jbosstm-bot
Copy link


for (Map.Entry<String, String> entry : containerDeploymentMap.entrySet()) {
stopContainer(entry.getKey(), entry.getValue());
}
clearRecoveryLogFromFS();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand this is clean up after the server is stopped so it's fine with me for this to move. I would be tempted to see if it's worth (in a follow up) being able to assert a fail if there is something there but this is just a thought and maybe some tests would still expect some file system records

Copy link
Member Author

@marcosgopen marcosgopen Jan 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point Tom. I think we wanted this clean up to prevent a test failure from contaminating following tests and appear (e.g. in the CI) that multiple tests have failed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be clear I see this topic as different to #27 (comment) but I will resolve this one

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I am not mandating the fail checks being added but to add readability of my review I will unresolve it but it does not need addressing IMO

clearRecoveryLogFromObjectStore();
clearRecoveryLogFromFS();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am OK with this being moved before the start of the container, it seems clean to me to do that. I would be tempted to fail the test if there was something in the objectstore after starting the new test (or make the assumption there should be nothing and remove the call to clearRecoveryLogFromObjectStore but these are remarks rather than requests for change

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Tom, I think this is related to my previous comment. In the CI we have the --fail-at-end flag when running tests. So we want to make sure that a failure is not impacting the following tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess part of what I am saying though is that I can't imagine a way that clearRecoveryLogFromObjectStore should have found a transaction and if it did something in Narayana is broken and needs fixing so a "fail()" would seem appropriate

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway, it was a remark so I will resolve this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I am not mandating the fail checks being added but to add readability of my review I will unresolve it but it does not need addressing IMO

@marcosgopen marcosgopen marked this pull request as ready for review January 10, 2025 09:42
@@ -319,7 +319,12 @@ public boolean restore_state(InputObjectState os, int ot) {
}

if (status == LRAStatus.Active) {
status = LRAStatus.Cancelling; // transition from Active to Cancelling
if (getSize(pendingList) == 0 && getSize(preparedList) == 0 && getSize(heuristicList) == 0) {
updateState(LRAStatus.Cancelled);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please can you share the pre-crash condition for the LRA such that (getSize(pendingList) == 0 && getSize(preparedList) == 0 && getSize(heuristicList) == 0) but the status was Active? Is this basically an LRA that had been created but no participants registered so we are just moving it to Cancelled as the TTL is expired. I guess so but if so maybe it's worth a comment because it at least made my curious to know how this could happen.

Copy link
Member Author

@marcosgopen marcosgopen Jan 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is the case tested here: https://github.com/jbosstm/lra/pull/27/files#diff-1a03090f8ce8ff7635f6bc880a133ff1d5f01a56955be2da86c33199963bf879R151
You are right, it is the case where no participants are registered and the LRA is active. For instance if we create a shortLRA and then the server crashes before the participants are registered, then when the server restarts the LRA is already expired.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, I think it might be worth a comment in the code just to mention that as it's not immediately obvious but I suppose it's not essential to do that. Thank you for the answer on here, though :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marcosgopen please can I ask you to confirm if this change is related in some way to JBTM-3966 or is it unrelated?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I think the two commits in the pull request should be prefixed with their numbers and then this would be clearer - is that OK with you, please?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And would the second commit be OK in it's own pull request or does it require the first commit? If so I think some linkage between https://issues.redhat.com/browse/JBTM-3966 and https://issues.redhat.com/browse/JBTM-3971 should be made in the JIRAs? I am not sure which but maybe JBTM-3966 should be marked as blocked by JBTM-3971?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think from the JIRA linkage now it can now be understood that JBTM-3971 really should be part of the same pull request (or merged first, so that JBTM-3966 will be more predictable) - thank you!


// The LRA transaction could have been started via lraClient but it is useful to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This statement is a little confusing because on the new line 155 lraClient is being used

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I see where it was moved from in the old diff and my thought is that the intention of the test before was to use the annotated client directly. I am not -1 to change to using lraClient (as long as we have tests for the filters elsewhere) but it seems like this comment is no longer appropriate?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I'll remove that comment.
As I mentioned above I will restore (with little changes) the previous test testing he annotated client.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, thank you!

// Verifies that the resource was notified that the LRA finished
String listenerStatus = getStatusFromListener(lraListenerURI);
Assert.assertTrue(String.format("LRA %s should have cancelled", shortLRA.toString()),
status == null || status == LRAStatus.Cancelled);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible for status to be null? What is the condition for that to happen, please?

Copy link
Member Author

@marcosgopen marcosgopen Jan 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be null only in case the server has not recovered the LRA yet. Since we have the wait() call before that we could harden the test so that we test the recovery is finished and the LRA is not null.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a "status == null" should be considered a failure if it means the server should know something about the LRA but does not

@@ -247,7 +213,6 @@ public void lraCoordinatorRecoveryTwoLRAs(@ArquillianResource @OperateOnDeployme
}

lraClient.closeLRA(longLRA);
lraClient.closeLRA(shortLRA);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please can you mention why removing this call is definitely fine now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a mistake in the test: the waiting time (doWait) was not enough to trigger the shortLRA timeout and so it needed to be closed manually.
SHORT_TIMEOUT is 10.000
LRAListener.LRA_SHORT_TIMELIMIT is 5L
(LRAListener.LRA_SHORT_TIMELIMIT +1L ) * 1000 = 6000

Now the shortLRA is already closed at the end of the test. I think now it makes more sense.
BTW calling closeLRA to a cancelled LRA returns a 412 error response (precondition failed)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me now. So if the LRA was cancelled (rather than cancelling) before these changes, this line would have caused a test failure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I believe so.

@marcosgopen marcosgopen changed the title Extra check for ShortLRA JBTM-3971 Set the proper state while recovering timed ont LRAs Jan 10, 2025
@marcosgopen marcosgopen force-pushed the JBTM-3966_fix branch 2 times, most recently from e0009a4 to 28c8ded Compare January 10, 2025 10:17
@mmusgrov mmusgrov dismissed their stale review January 10, 2025 10:18

There have been a lot of changes since the version I approved and I'm not clear on why so many changes are needed.

@jbosstm-bot
Copy link

@jbosstm-bot
Copy link

@jbosstm-bot
Copy link

LRA profile tests failed (https://ci-jenkins-csb-narayana.apps.ocp-c1.prod.psi.redhat.com/job/btny-pulls-lra/PROFILE=LRA,jdk=openJDK17,label=jnlp-agent/101/): LRA Test failed with failures in arq profile

@jbosstm-bot
Copy link

if (shortStatus == LRAStatus.Cancelling) {
recover();
shortStatus = getStatus(shortLRA);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marcosgopen please can you clarify why this block is a better approach than the previous block: https://github.com/jbosstm/lra/pull/27/files#diff-1a03090f8ce8ff7635f6bc880a133ff1d5f01a56955be2da86c33199963bf879L221-L227 they look to be somewhat similar to me.

Copy link
Member Author

@marcosgopen marcosgopen Jan 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some recent changes (wildfly/wildfly#18499) the recovery is now triggered by the server startup, so it should not need to be triggered manually. (Edited)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, that's clear then. I think I would say "should not need to be triggered manually" but that is clearer now to why it's different. Thank you

When shortLRAs time out when the server is down the abortLRA method is not called. When the server restart the scheduler is not restored because already timed out and the LRA status keeps being cancelling or closing indefinitely
@jbosstm-bot
Copy link

@jbosstm-bot
Copy link

@jbosstm-bot
Copy link

@jbosstm-bot
Copy link

LRA profile tests failed (https://ci-jenkins-csb-narayana.apps.ocp-c1.prod.psi.redhat.com/job/btny-pulls-lra/PROFILE=LRA,jdk=openJDK17,label=jnlp-agent/102/): LRA Test failed with failures in arq profile

Comment on lines +193 to +194
// wait almost until the LRA is timed out)
doWait((LRAListener.LRA_SHORT_TIMELIMIT * 1000) - 500);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please can I check if it is mandatory to wait here before restarting? If so perhaps the comment could be expanded to clarify why?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, I'll expand my comment

@@ -163,36 +186,19 @@ public void lraCoordinatorShortTimeoutLRA(@ArquillianResource @OperateOnDeployme

Assert.assertEquals("LRA participant action", 200, response.getStatus());

// The LRA transaction could have been started via lraClient but it is useful to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please can I ask why this was moved before the restart of the container? This might end up a duplicate discussion of https://github.com/jbosstm/lra/pull/27/files#r1910297173 but I am not sure

@jbosstm-bot
Copy link

@jbosstm-bot
Copy link


// LRA with short timeout should have timed out and cancelled
status = getStatus(new URI(lraId));

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand why this is being removed - it's the same reason as https://github.com/jbosstm/lra/pull/27/files#r1910235210

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I am wondering if it might be useful to add the replacement you made there (or something similar) because the doWait((LRAListener.LRA_SHORT_TIMELIMIT * 1000) - 500); (on line 194) is not long enough to timeout the LRA and even though when combined with the time restartContainer(LRA_COORDINATOR_CONTAINER_QUALIFIER); (from 196) takes, it is probably long enough to timeout the LRA, it might not be and so the LRAStatus status = getStatus(new URI(lraId)); on (now) line 200 might find the LRA Cancelling still and so the assert on 202-203 might fail?

@jbosstm-bot
Copy link

@jbosstm-bot
Copy link

@marcosgopen marcosgopen changed the title JBTM-3971 Set the proper state while recovering timed ont LRAs JBTM-3971 Set the proper state while recovering timed out LRAs Jan 10, 2025
}
else {
updateState(LRAStatus.Cancelling);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a question if we can move this to new code to use the scheduleCancellation function somehow. As I understand it abortLRA causes a locking issue but scheduleCancellation would hopefully be in a different thread.

@mmusgrov
Copy link
Member

Please note that if a LRA is in cancelling status but all the lists are empty it is considered finished:

The spec [1] says that a participant can ask to be notified of the outcome of the LRA via the @AfterLRA annotation as long as the LRA hasn't reached one of the terminal states.

[1] https://download.eclipse.org/microprofile/microprofile-lra-2.0/microprofile-lra-spec-2.0.html#_quick_overview_of_annotations
[2] https://github.com/eclipse/microprofile-lra/blob/2.0/api/src/main/java/org/eclipse/microprofile/lra/annotation/AfterLRA.java#L37

@marcosgopen
Copy link
Member Author

I put this on hold while I investigate further the issue and address your comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants