-
Notifications
You must be signed in to change notification settings - Fork 720
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[cardano-testnet] Fix flaky test workspace cleanup and node port allocation #5875
Conversation
0cb4c03
to
a6a8f55
Compare
Uhm, seems fix does not work. Still failing with:
Most likely because epoch state logging gets in the way and gets cleaned up in |
d2aac0a
to
75979d6
Compare
f340160
to
9432f8d
Compare
69dc77a
to
f0d1ae9
Compare
84ae07d
to
3ec2905
Compare
Failures because of slow port closing after
This happens on slow machines, like macos runners. |
5b6dc81
to
c3f5ec6
Compare
New failure after fixing the previous two:
|
813f7c0
to
a63f19e
Compare
a63f19e
to
479504b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work 👍 . One suggestion surrounding exponential backoff. waitForPortClosed
will loop forever; we should probably fail at some point.
cardano-testnet/src/Testnet/Ping.hs
Outdated
writeIORef lastResult isOpen | ||
when isOpen $ do | ||
-- repeat when port open | ||
MT.threadDelay interval |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts on using an exponential back off so this doesn't loop forever?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's wrapped in timeout, so it will be interrupted eventually. I'll use here retrying
instead. It will simplify the code.
H.runFinallies $ workspace (workspaceName <> "-" <> show i) f | ||
|
||
-- | Create a workspace directory which will exist for at least the duration of | ||
-- the supplied block. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you mean by "supplied block"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, the comment doesn't make much sense. I've copied haddock as-is from https://hackage.haskell.org/package/hedgehog-extras-0.6.5.0/docs/Hedgehog-Extras-Test-Base.html#v:workspace
I'll upstream it after merging this PR. I'll fix it in hedgehog-extras
f ws | ||
when (os /= "mingw32" && maybeKeepWorkspace /= Just "1") $ do | ||
-- try to delete the directory 5 times, 100ms apart | ||
let retryPolicy = R.constantDelay 100_000 <> R.limitRetries 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another spot for exponential backoff? I see retry
has the exponentialBackoff
combinator :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we will gain much with exponential backoff. The operation is not expensive so we're not using CPU usage or anything.
I'd say exponential backoff would make sense in resource intensive operations, like network communication.
Right h -> pure h | ||
Left e | ||
-- give up after 1000 attempts | ||
| n >= 1000 -> throwE e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential spot for exponential backoff
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, we're not retrying anything here. I'm assuming if the file exists, it'll be locked for a long time, and just try to create a file with the next name.
479504b
to
d8956c8
Compare
d8956c8
to
53a3f65
Compare
Description
This PR fixes testnet startup nondeterministic issues:
TIME_WAIT
state. Fixed by waiting and retrying.stderr
log file, resulting in node start retry failure. Fixed by additional manual closing of the file handle and using a different name of the log file.Checklist
See Runnings tests for more details
CHANGELOG.md
for affected package.cabal
files are updatedhlint
. See.github/workflows/check-hlint.yml
to get thehlint
versionstylish-haskell
. See.github/workflows/stylish-haskell.yml
to get thestylish-haskell
versionghc-8.10.7
andghc-9.2.7
Note on CI
If your PR is from a fork, the necessary CI jobs won't trigger automatically for security reasons.
You will need to get someone with write privileges. Please contact IOG node developers to do this
for you.