Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JBTM-3966 fix crash tests #26

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

marcosgopen
Copy link
Member

@marcosgopen marcosgopen commented Dec 23, 2024

https://issues.redhat.com/browse/JBTM-3966

Recovery is now triggered during the recovery service startup.

@jbosstm-bot
Copy link

@jbosstm-bot
Copy link

@jbosstm-bot
Copy link

@jbosstm-bot
Copy link

LRA profile tests failed (https://ci-jenkins-csb-narayana.apps.ocp-c1.prod.psi.redhat.com/job/btny-pulls-lra/PROFILE=LRA,jdk=openJDK17,label=jnlp-agent/92/): LRA Test failed with failures in arq profile

@@ -269,28 +231,6 @@ private void doWait(long millis) {
}
}

private int recover() {
Copy link
Member Author

@marcosgopen marcosgopen Jan 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The recovery is triggered by the WildFly startup

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we test on a different container?

Copy link
Member Author

@marcosgopen marcosgopen Jan 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point @mmusgrov ! Would you rather keep the recover method for different containers testing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just restored the recover method

@jbosstm-bot
Copy link

@jbosstm-bot
Copy link

@jbosstm-bot
Copy link

@jbosstm-bot
Copy link

@jbosstm-bot
Copy link

@jbosstm-bot
Copy link

@jbosstm-bot
Copy link

@jbosstm-bot
Copy link


assertNotNull("A new LRA should have been added to the object store before the JVM was halted.", shortLRA);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we aren't reading the lra from the filesystem (lraId = getFirstLRAFromFS();) this assert is no longer required.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I'll remove it.

recover();
}
}
doWait(SHORT_TIMEOUT + 5000L);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously the test waited for SHORT_TIMEOUT + 1 second, do you know why we now need to wait SHORT_TIMEOUT + 5 seconds - the reason I mention it is that too many waits makes the test suite run a lot slower and we have other tests that sleep so the cumulative effect becomes noticeable when running the full test suite.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. The reason was to keep the test more stable, but I can reset the wait as SHORT_TIMEOUT + 1 second, hopefully it is equally stable on CI.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we see failures then we can add such a workaround and raise the priority of replacing the sleeps with an appropriate byteman rule.

// Verifies that the resource was notified that the LRA finished
String listenerStatus = getStatusFromListener(lraListenerURI);
Assert.assertTrue(String.format("LRA %s should have cancelled", shortLRA.toString()),
status == null || status == LRAStatus.Cancelled || status == LRAStatus.Cancelling);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know why the LRA may still be in the cancelling state even though your have extended the wait by 4 seconds (doWait(SHORT_TIMEOUT + 5000L);).

@jbosstm-bot
Copy link

@jbosstm-bot
Copy link

@jbosstm-bot
Copy link

@jbosstm-bot
Copy link

@marcosgopen
Copy link
Member Author

I put back the Hold label because when the LRA timeout is shorter (1 sec) the LRA status will be 'CANCELLING' instead of 'CANCELLED' after the recovery. So I need to further investigate it.

@jbosstm-bot
Copy link

JACOCO profile tests failed (https://ci-jenkins-csb-narayana.apps.ocp-c1.prod.psi.redhat.com/job/btny-pulls-lra/PROFILE=JACOCO,jdk=openJDK17,label=jnlp-agent/96/): LRA Test failed with failures in arq profile

@jbosstm-bot
Copy link

LRA profile tests failed (https://ci-jenkins-csb-narayana.apps.ocp-c1.prod.psi.redhat.com/job/btny-pulls-lra/PROFILE=LRA,jdk=openJDK17,label=jnlp-agent/96/): LRA Test failed with failures in arq profile

@jbosstm-bot
Copy link

JACOCO profile tests failed (https://ci-jenkins-csb-narayana.apps.ocp-c1.prod.psi.redhat.com/job/btny-pulls-lra/PROFILE=JACOCO,jdk=openJDK17,label=jnlp-agent/97/): LRA Test failed with failures in arq profile

@jbosstm-bot
Copy link

LRA profile tests failed (https://ci-jenkins-csb-narayana.apps.ocp-c1.prod.psi.redhat.com/job/btny-pulls-lra/PROFILE=LRA,jdk=openJDK17,label=jnlp-agent/97/): LRA Test failed with failures in arq profile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants