Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try to fix reset-project Playwright tests #789

Merged
merged 33 commits into from
May 30, 2024

Conversation

rmunn
Copy link
Contributor

@rmunn rmunn commented May 7, 2024

We believe the failures in the Playwright tests are due to NFS caching the directory entry, and thus not noticing that the directory has new contents, resulting in Mercurial thinking the directory is still empty for some time after the zip file upload has completed. It is possible to force NFS to refresh its cache of a directory by doing opendir() and closedir() on that directory. I don't know if C#'s Directory.GetFiles method, which uses the openat() system call rather than opendir() and closedir(), will have the same effect — but it's worth a try.

Update: The other piece of the puzzle appears to be caching on the side of the LexBox API pod's NFS client. NFS writes are cached until one of the following events occurs (scroll down to "The sync mount option" heading):

  • Memory pressure forces reclamation of system memory resources.
  • An application flushes file data explicitly with sync(2), msync(2), or fsync(3).
  • An application closes a file with close(2).
  • The file is locked/unlocked via fcntl(2).

So doing an opendir/closedir (which actually needs to happen on the hgweb pod, not the lexbox pod) is not enough; we also need to force the lexbox pod's NFS client to send its data to the NFS server. I've implemented that by doing the equivalent of mkdir random-name; rmdir random-name inside the Mercurial repo after the LexBox API pod touches it (e.g. a project reset or FinishReset operation), followed by a call to ls on the hgweb side. So far that seems to be working; more experimentation is needed to see whether the ls is necessary, or whether the mkdir/rmdir is enough. (Or perhaps it truly does need to be the C# equivalent of cp /dev/null random-file; rm random-file so that a close() system call is issued).

Might fix #765.

Copy link

github-actions bot commented May 7, 2024

C# Unit Tests

57 tests  ±0   57 ✅ +1   10s ⏱️ ±0s
11 suites ±0    0 💤 ±0 
 2 files   ±0    0 ❌  - 1 

Results for commit 6c2b514. ± Comparison against base commit 3c62210.

♻️ This comment has been updated with latest results.

@rmunn
Copy link
Contributor Author

rmunn commented May 7, 2024

I'm deploying the develop branch (without this PR) to the develop server now. Then I'll run the Playwright tests against it, and verify that they fail. Then I'll deploy this PR branch to the develop server and see if it makes the Playwright tests pass.

Copy link

github-actions bot commented May 7, 2024

UI unit Tests

11 tests   11 ✅  0s ⏱️
 3 suites   0 💤
 1 files     0 ❌

Results for commit 6c2b514.

♻️ This comment has been updated with latest results.

@rmunn
Copy link
Contributor Author

rmunn commented May 7, 2024

Can't deploy the develop branch to the develop server? Huh? It's very strange to me that the "Build hgweb / deploy hgweb / deploy" step failed actions/checkout@v4 can't find the default branch of the repo it was checking out. Perhaps we should change that repo's default branch name back to master instead of main and see what happens.

backend/LexBoxApi/Services/HgService.cs Outdated Show resolved Hide resolved
@rmunn rmunn force-pushed the chore/try-to-fix-playwright-reset-tests branch from f6165f7 to d3e550d Compare May 8, 2024 03:11
@rmunn
Copy link
Contributor Author

rmunn commented May 8, 2024

Was able to find proof that running ls from the lexbox pod after a project reset does not fix the 404s, and I think I know why. I just realized that the NFS directory cache is on the NFS client (in our case, the hgweb pod), not the NFS server (which would be the Rancher storage-manager container). Each NFS client (each pod that mounts a volume) has its own directory cache; that's kind of the whole point of NFS caching, to avoid unnecessary server round-trips. So to avoid 404s coming from hgweb, we need to do the ls on hgweb.

I'm going to remove the Process.Start("ls") and/or Directory.GetFiles() calls from the C# backend, and insert an ls line into the command runner. Then I'll continue testing, to prove whether that solves the problem for good.

@hahn-kev
Copy link
Collaborator

hahn-kev commented May 8, 2024

Was able to find proof that running ls from the lexbox pod after a project reset does not fix the 404s, and I think I know why. I just realized that the NFS directory cache is on the NFS client (in our case, the hgweb pod), not the NFS server (which would be the Rancher storage-manager container). Each NFS client (each pod that mounts a volume) has its own directory cache; that's kind of the whole point of NFS caching, to avoid unnecessary server round-trips. So to avoid 404s coming from hgweb, we need to do the ls on hgweb.

I'm going to remove the Process.Start("ls") and/or Directory.GetFiles() calls from the C# backend, and insert an ls line into the command runner. Then I'll continue testing, to prove whether that solves the problem for good.

yeah that's what I said here: #765 (comment)

@rmunn
Copy link
Contributor Author

rmunn commented May 8, 2024

yeah that's what I said here: #765 (comment)

Yep. I thought about that last week but wanted to verify whether changes to the C# code were enough. Proved that they weren't. However, I've also managed to prove that running ls on hgweb will fix the project-reset tests. So next I'll work out whether there are any situations where hgweb is the one modifying the repo directories (in ways that would need invalidating cache, i.e. just receiving an hg push wouldn't matter).

BTW, I ran the projectReset.test.ts tests about 30 times, using ls on hgweb, before finally hitting an error -- and it was an HTTP 502 Bad Gateway caused by a network hiccup. So I'm satisfied that commit c87ec67 is fixing the issue.

@rmunn
Copy link
Contributor Author

rmunn commented May 8, 2024

Commit 5cc036e contains the change I had made locally, but not yet pushed, when I ran the reset-project E2E test 30 times without a failure. (It only works because of commit c87ec67, which makes the ls run whenever you run anything in command-runner.sh).

I think I'm going to push one more commit, adding a new command to the command runner called "refreshdircache". It will not call any hg commands, so its entire purpose is to run that ls and then do nothing. (That's mildly faster than running the tip command and throwing away the result, because we don't have to start an hg process). Once that's done, this will be ready for review.

@rmunn
Copy link
Contributor Author

rmunn commented May 8, 2024

Marking as draft PR for now because there are more changes I plan for tomorrow, so there's not much point in reviewing it in its current state.

@rmunn rmunn marked this pull request as draft May 8, 2024 09:22
@rmunn rmunn force-pushed the chore/try-to-fix-playwright-reset-tests branch from 9691a40 to 87582d0 Compare May 9, 2024 01:38
Pull in changes to deleting draft projects, necessary in order to run
Playwright E2E tests more than once
If FinishReset gets an upload of an *empty* Mercurial repo, then
WaitForRepoEmptyState could wait forever. To catch that situation 99% of
the time, we can simply check for the existence of `00changelog.i` in
the Mercurial repo store: that file is guaranteed to exist if the repo
has commits, so its absence means an empty repo.

To ensure belt-and-suspenders, we still need to implement a timeout in
the WaitForRepoEmptyState method, but this is a good start.
backend/LexBoxApi/Services/HgService.cs Dismissed Show resolved Hide resolved
We now choose a random filename and check its existence before creating
it. In theory there's still a chance that another process could create
the file before we get a chance to, but in practice that's never going
to happen.

We also create a directory rather than a file because on the very, very
slim chance that another process creates a file between when we check
its existence and when we create it, the Directory.Create call will
fail. So it's minutely safer.
@rmunn
Copy link
Contributor Author

rmunn commented May 28, 2024

I replaced the call to sync with creating, and then deleting, an empty directory inside the project repo. I could have used a file, but a directory is slightly safer (in filename-conflict situations that approximately never come up anyway, and by "never" I mean the chances of the server getting hit by a meteor are greater than the chance of the filename conflicting). This seems to work: I'm getting consistent test passes, and the tests aren't taking too long.

I'm cautiously optimistic, and going to take this PR off of draft status because I think I may have finally solved the problem.

@rmunn rmunn marked this pull request as ready for review May 28, 2024 09:50
@rmunn rmunn requested review from hahn-kev and myieye May 28, 2024 09:50
@rmunn
Copy link
Contributor Author

rmunn commented May 28, 2024

NOTE: I've marked this as ready for review, so I don't want to push experimental commits while others might be reviewing this. Therefore, I've created the tmp/playwright-reset-test-experiments branch to experiment with various permutations of the fix in this branch. I'll be pushing to tmp/playwright-reset-test-experiments and deploying it to the develop server, but I plan to leave this branch alone for a while. So you can review it, knowing that it will be stable for a few hours, and won't be constantly getting pushes as I run further tests.

Copy link
Collaborator

@hahn-kev hahn-kev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, though there's some changes you left in that shouldn't have made it in I suspect

backend/LexBoxApi/Services/HgService.cs Outdated Show resolved Hide resolved
backend/Testing/LexCore/Services/HgServiceTests.cs Outdated Show resolved Hide resolved
hgweb/command-runner.sh Show resolved Hide resolved
rmunn added 3 commits May 29, 2024 10:51
This allows us to uncomment the tests we were skipping earlier, because
now we can mock HttpClient and not allow HgService to hit the network
during unit tests.
Especially in WaitForRepoEmptyState, we'll want to make sure the
function times out after a while so we can't end up locking up the
server due to some unforeseen error.
@rmunn rmunn requested a review from hahn-kev May 29, 2024 04:26
@rmunn
Copy link
Contributor Author

rmunn commented May 29, 2024

@hahn-kev - Addressed all your review comments. Let me know if there's anything else that needs changing.

Copy link
Collaborator

@hahn-kev hahn-kev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still some things to fix

backend/LexBoxApi/Services/HgService.cs Outdated Show resolved Hide resolved
backend/LexBoxApi/Services/HgService.cs Outdated Show resolved Hide resolved
backend/LexBoxApi/Services/HgService.cs Outdated Show resolved Hide resolved
rmunn added 2 commits May 30, 2024 10:51
I forgot that .NET code will throw OperationCanceledException if passed
a token that gets canceled. We don't want WaitForRepoEmptyState to throw
if its timeout is reached, we want it to return as if it was successful.
So we need to catch and suppress that exception.

Also passed the token into Task.Delay so that if the timeout is reached
during the delay, the delay will cancel immediately instead of running
to completion.
@rmunn rmunn requested a review from hahn-kev May 30, 2024 04:05
Copy link
Collaborator

@hahn-kev hahn-kev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice work Robin!

@rmunn rmunn merged commit 20ffa08 into develop May 30, 2024
15 of 16 checks passed
@myieye myieye deleted the chore/try-to-fix-playwright-reset-tests branch January 14, 2025 11:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

After uploading a .zip file in project reset, LexEntryCount incorrecly remains 0
2 participants