Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Download check for cache directory #559

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

gerhardol
Copy link
Contributor

Add a marker .download file to validate the contents in cache directories. Previously only the existence of the directory was used, so if the download was aborted the cache directory had to be deleted manually if this occurred (with a likely cryptic error message). If the .download check file does not exist, the directory will be deleted and downloaded again.

It is also possible to check the contents with a checksum. If not matching, the directory will be deleted and downloaded again.

For Git repos the repos can be deleted if the status is not clean, a checksum is not relevant (but used in the tests).

Note: The important part of this PR is the .download file, that will cover the majority of the issues.

@kyle-verdant
Copy link

Does this handle submodules well? The failures I most often saw were not that the main directory was empty, but that some required submodule didn't exist - most often with aws-sdk-cpp (don't use aws-sdk-cpp with CPM, it's too big).

@gerhardol
Copy link
Contributor Author

Does this handle submodules well? The failures I most often saw were not that the main directory was empty, but that some required submodule didn't exist - most often with aws-sdk-cpp (don't use aws-sdk-cpp with CPM, it's too big).

This will currently make no change if the download is aborted. CMake FetchContents makes some retries, then the job fails. If CMake reports this correctly, then the marker file is not created.
I do not know if that is an issue really, have not got good statistics why the core problem occurs.

The change is for cache and reusing the downloaded data. If no marker file, the directory is incomplete and will be deleted and downloaded again.
I assume CMake detects failures correctly, this should be the core change.

If checksum is used, it could handle also failed checksums.
With Git you do not need to use a checksum. If you set CPM_CHECK_CACHE_CHECKSUM then the directory is removed if git-status fail.

@ScottBailey
Copy link
Contributor

@gerhardol This looks like something that was missing from CPM! Maybe it's jammed up because it fails the tests? IDK. You can fix that pretty easy, the instructions are in the last few lines of CONTRIBUTING.md.

LMK when you get that done and I'll try to get a review in (but I can only really do a big review on the weekend, I think). Then we can ping a maintainer to get this merged.

@gerhardol
Copy link
Contributor Author

@gerhardol This looks like something that was missing from CPM! Maybe it's jammed up because it fails the tests? IDK. You can fix that pretty easy, the instructions are in the last few lines of CONTRIBUTING.md.

LMK when you get that done and I'll try to get a review in (but I can only really do a big review on the weekend, I think). Then we can ping a maintainer to get this merged.

When I apply the style, there are many changes to lines I have not changed too.
I pushed a tmp branch for now, on vacation and cannot investigate this more right now.

@ScottBailey
Copy link
Contributor

When I apply the style, there are many changes to lines I have not changed too. I pushed a tmp branch for now, on vacation and cannot investigate this more right now.

Enjoy your vacation! This can wait. :-)

@gerhardol gerhardol force-pushed the feature/download-check branch 4 times, most recently from 1539d3e to 7d7113f Compare September 9, 2024 17:00
@gerhardol
Copy link
Contributor Author

Test fails due to the example test/unit/checksum_directory.sh is used in tests and MacOs has other arguments.
MacOs can be fixed, but Windows is harder, I would like to skip those in tests.
(Checksum is not essential)

@gerhardol gerhardol force-pushed the feature/download-check branch 3 times, most recently from bc31cc5 to b641804 Compare September 10, 2024 07:27
Add a marker .download file to validate the contents in cache directories.
Previously only the existence of the directory was used, so if the
download was aborted the cache directory had to be deleted manually if
this occurred (with a likely cryptic error message).
If the .download check file does not exist, the directory will be deleted
and downloaded again.

It is also possible to check the contents with a checksum.
If not matching, the directory will be deleted and downloaded again.

For Git repos the repos can be deleted if the status is not clean,
a checksum is not relevant (but used in the tests).
@gerhardol gerhardol force-pushed the feature/download-check branch from b641804 to 8fceb72 Compare September 10, 2024 07:31
@gerhardol
Copy link
Contributor Author

MacOs handled, Windows checksum example is not provided (it depends on if msys, cygwin or some native program is used, it is up to Win users to provide such an example).
CMake native to calculate the checksum is too slow, I expect users that really want the checksum to tweak the algorithm anyway.
The important part of this change is the .download file marker, to not handle aborted downloads.

@christopherbate
Copy link

Can this be merged?

@gerhardol
Copy link
Contributor Author

Can this be merged?

I cannot do much. But if there are reviews and usage stories, maintainers are inspired and more comfortable merging.

@TheLartians
Copy link
Member

Hey, thanks for the PR and sorry for chiming in so late. I've been very busy and now have some time during the holidays to catch up. I really like the idea of checking the integrity of the cache directory. However there are some interactions users would need to consider:

  • Certain OS features or extensions can change the contents of the cache directory, e.g. just looking at a folder on Mac will create a .DS_Store file inside it
  • Sometimes CMakeLists of projects generate files inside the source directory that mess with the checksum
  • There may be unexpected interactions with the PATCHES or other commands that could change cached contents without modifying the CPM hash
  • I'm not sure how well the shell script will play with Windows machines

While I also see this an important feature, I think we should take special care to think about how we want to implement it. An alternative approach I've been thinking about would be implementing something closer to how npm handles dependencies, where it keeps the original download zips or git repos in the cache and copies them into a dependency directory in the current project on configuration (e.g. the CMAKE_BINARY_DIR). That way we could validate the SHA hash of the zip or commit to ensure the integrity of the cache and then apply patches etc. per-project in an isolated copy without having to worry about poisoning the cache.

@gerhardol
Copy link
Contributor Author

There are really two parts of this PR:

  • The .download check to require that the download is complete. Without this, the cache is unusable.
  • The checksum check for the cache

The second part is optional. It will not always work as you say, but then do not enable it.
I can remove that part or stress this harder in the description.

For the cache location: I do not want to expand and store 10GB toolchain files for 100 different clones, too slow and too much disk.
That handling should be optional too.

@TheLartians
Copy link
Member

I see, thanks for clarifying! To make the discussion easier I would suggest splitting this PR into adding the .download metadata storage and then build on it to add the checksum.

For the cache location: I do not want to expand and store 10GB toolchain files for 100 different clones, too slow and too much disk.

I agree that it would be impractical for large dependencies, perhaps it would still make sense if we can configure it on a per-dependency basis. But I would like to have this discussion in another PR / issue.

@patstew
Copy link
Contributor

patstew commented Dec 23, 2024

Just an idea, it might be more robust to just download to a directory called /hash.tmp/, then when it is finished rename the directory to /hash/. That way you don't need to mess about with the extra .download file.

If you make the .tmp folder have a random name instead you can also potentially avoid the need to use the lock file too, multiple cmake instances can safely download stuff into the cache directory, and in the case where two happen to download the same thing at the same time only one will succeed at the rename.

@gerhardol
Copy link
Contributor Author

Just an idea, it might be more robust to just download to a directory called /hash.tmp/, then when it is finished rename the directory to /hash/. That way you don't need to mess about with the extra .download file.

If you make the .tmp folder have a random name instead you can also potentially avoid the need to use the lock file too, multiple cmake instances can safely download stuff into the cache directory, and in the case where two happen to download the same thing at the same time only one will succeed at the rename.

A separate download folder does not work too well, I have used it for a private cache handling and ACL can be setup so a directory cannot be renamed (also with a temp directory in the parallel to the target directory).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants