Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle json decode error in msvc cache #4402

Merged
merged 4 commits into from
Sep 4, 2023

Conversation

mwichmann
Copy link
Collaborator

@mwichmann mwichmann commented Aug 26, 2023

A bit of a revamp of read_script_env_cache and write_script_env_cache. The significant change is that on read, a JSDONDecodeError is detected, and the cache file removed if so - we're guessing a write race corrupted the file.

If this works, will update with more "PR polish", else will withdraw. The "guess" is because only the AppVeyor build on Py 3.6/VS 2017 hits this, haven't reproduced in a developer context yet.

There's no external visibility, so no doc impacts.

Contributor Checklist:

  • I have created a new test or updated the unit tests to cover the new/changed functionality.
  • I have updated CHANGES.txt (and read the README.rst)
  • I have updated the appropriate documentation

@mwichmann mwichmann added the MSVC Microsoft Visual C++ Support label Aug 26, 2023
@mwichmann
Copy link
Collaborator Author

The 3.6 build passed.

@bdbaddog
Copy link
Contributor

Looks good to me. PR polish it up and we'll get it merged.

@mwichmann
Copy link
Collaborator Author

aaand.... the vs2019/py3.8 version failed...

@mwichmann
Copy link
Collaborator Author

Perhaps a one-off...

[00:12:32] 548/1276 (42.95%) C:\Python38\python.exe test\Interactive\cache-force.py
[00:12:32] Traceback (most recent call last):
[00:12:32]   File "test\Interactive\cache-force.py", line 77, in <module>
[00:12:32]     shutil.rmtree(test.workpath('cache'))
[00:12:32]   File "C:\Python38\lib\shutil.py", line 740, in rmtree
[00:12:32]     return _rmtree_unsafe(path, onerror)
[00:12:32]   File "C:\Python38\lib\shutil.py", line 613, in _rmtree_unsafe
[00:12:32]     _rmtree_unsafe(fullname, onerror)
[00:12:32]   File "C:\Python38\lib\shutil.py", line 618, in _rmtree_unsafe
[00:12:32]     onerror(os.unlink, fullname, sys.exc_info()) 
[00:12:32]   File "C:\Python38\lib\shutil.py", line 616, in _rmtree_unsafe 
[00:12:32]     os.unlink(fullname)
[00:12:32] PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\Users\\appveyor\\AppData\\Local\\Temp\\1\\testcmd.3156.7qs4b5ta\\cache\\43\\439f39656a672b443377ce1f8f10f2ac.tmp745b176afa424b60abd41d46c8b225fb'

@mwichmann
Copy link
Collaborator Author

aaand.... the vs2019/py3.8 version failed...

worked on the rebuild after the PR (non-code) update...

@mwichmann
Copy link
Collaborator Author

mwichmann commented Aug 27, 2023

Okay, the latest build caught a conflict on camera:

[00:20:08] 671/1276 (52.59%) C:\Python311\python.exe test\MSVC\MSVC_TOOLSET_VERSION.py
[00:20:08] C:\projects\scons\scripts\scons.py returned 2
[00:20:08] STDOUT =========================================================================
[00:20:08] 
[00:20:08] STDERR =========================================================================
[00:20:08] PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\Users\\appveyor\\scons_msvc_cache.json':
[00:20:08]   File "C:\Users\appveyor\AppData\Local\Temp\1\testcmd.6564.htzu3wg5\SConstruct", line 3:
[00:20:08]     env = Environment(MSVC_VERSION='14.3', MSVC_TOOLSET_VERSION='14.30.17.0', tools=['msvc'])   

I think this means we are getting races on the cachefile. I wonder if we need to be more polite on writes... like elsewhere in the codebase, write it to a temporary file, then rename it over if it succeeded. Not quite sure the implications - dueling changes to the contents? Or maybe try some kind of locking protocol? Hate to get too fancy, this was supposed to be a quick hack to let tests run better, now it's showing hydra behavior...

@mwichmann
Copy link
Collaborator Author

mention @jcbrill so he's aware of this pending stuff too...

@jcbrill
Copy link
Contributor

jcbrill commented Aug 27, 2023

@mwichmann Thanks.

I was looking at the changes earlier today. Looks good to me.

I periodically experience the file access errors. Typically happens when I forget to disable windows defender real-time scanning in a Windows VMWare virtual machine. I believe the disabled state resets to enabled when the virtual machine is rebooted.

@mwichmann mwichmann changed the title Handle json decode error in mavc cache Handle json decode error in msvc cache Aug 27, 2023
@mwichmann
Copy link
Collaborator Author

The previous note about a failure would seem to be related to #4268

@jcbrill
Copy link
Contributor

jcbrill commented Aug 27, 2023

The previous note about a failure would seem to be related to #4268

@mwichmann There is definitely a race condition (and likely cache information loss) as the cache is read-once, write-multiple times and unprotected from multiple-process access.

I have a "reference implementation" outside of SCons that uses a Windows named mutex to serialize access to the cache which relies on correctly "normalizing" the real path to the cache file.

The repository for the code is still private. I would be happy to add you and @bdbaddog as a contributor in the short term to be able to view/review the implementation.

@mwichmann
Copy link
Collaborator Author

that would be cool. I've got a simple file locker sitting around which might also be of use; it tries to be portable so doesn't use any specific tricks (just tries to get exclusive access to a lockfile), but I think we only care about Windows, so someting WIndows-specific might be fine. Or do we care about the Mac case?

@jcbrill
Copy link
Contributor

jcbrill commented Aug 27, 2023

An invite should have been sent.

Can be extraordinarily inefficient when multiple processes write many entries at the same time due to loss protection. Tested earlier with 4 python interpreters building all possible combinations (i.e., long-running and writing many entries). This is worst-case scenario as they are basically running in lock-step and possibly running the same cache entry build at the same time.

Files of interest:

  • mswindev/winsync.py (wrapper for windows mutex)
  • mswindev/winapi.py (windows api calls via ctypes)
  • mswindev/vsdetect.py (cache implementation starting around line 12,573)

Edit:
Feel free to send any questions to my personal email.

A bit of a revamp of read_script_env_cache and write_script_env_cache.
The significant change is that on read, a JSDONDecodeError is
detected, and the cache file removed if so - we're guessing a
write race corrupted the file.

If this works, will update with more "PR polish", else will withdraw.

Signed-off-by: Mats Wichmann <[email protected]>
@jcbrill
Copy link
Contributor

jcbrill commented Aug 28, 2023

@mwichmann I'm not sure the warning messages are going to populate correctly. It appears like a mix of % formatting with f-strings.

Since the same format string is used in both the warning message and the debug statement, probably should be %-formatting with the explicit population (e.g.,fmt % (value,...)) and removal of the f-string declaration.

As always, I could be wrong.

@mwichmann
Copy link
Collaborator Author

ah, forgot it's used twice. for debug, it's logging and so you're supposed to use (template, args...). I'll take a look later, deep in something at the moment.

@mwichmann
Copy link
Collaborator Author

Yes, I got caught between changes - did something, then changed it again and didn't clean up. Thanks for spotting.

@bdbaddog bdbaddog merged commit cbd156d into SCons:master Sep 4, 2023
@mwichmann mwichmann added this to the 4.6 milestone Sep 4, 2023
@mwichmann mwichmann deleted the maint/cache-decode-err branch September 4, 2023 14:01
@mwichmann mwichmann mentioned this pull request Sep 4, 2023
3 tasks
mwichmann added a commit to mwichmann/scons that referenced this pull request Sep 4, 2023
Following on from SCons#4402, which tries to recover from races that might
either cause a corrupt msvc cache file, or an incomplete read happening
while a write is still in process, add a simple-minded locking protocol to
try to prevent the problem in the first place.  For writing, a lockfile
is created with exclusive access. For reading, the same is done but
immediately released: we only wanted to wait that it's not currently
locked, we don't need to keep it locked at this point.

This is addressing what's mainly an issue for testing, when there can
be many concurrent SCons instance each running a test - we don't think
this is likely in normal developer usage.

Signed-off-by: Mats Wichmann <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
MSVC Microsoft Visual C++ Support
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants