Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

huggingface-cli upload - Validate README.md before file hashing #2452

Merged
merged 1 commit into from
Aug 16, 2024

Conversation

hlky
Copy link
Contributor

@hlky hlky commented Aug 15, 2024

  • README.md validation is moved from create_commit to _prepare_upload_folder_additions.
  • _prepare_upload_folder_additions is moved into HfApi class, this provides access to _build_hf_headers.
  • repo_type and token are added to _prepare_upload_folder_additions's parameters, these are needed for the request to /api/validate-yaml endpoint.
  • Result from filter_repo_objects stored in filtered_repo_objects (converted to a list, as it's a generator) to avoid recalculation as it is now used to find README.md then later to return the CommitOperationAdds.
  • If found, a temporary CommitOperationAdd is created for README.md to provide as_file.
  • break added after README.md validation to avoid iterating other files.

Closes #2451

Copy link
Contributor

@Wauplin Wauplin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @hlky, thanks for the PR and detailed description. I've left a couple comments to address before merging. The two main things are:

  • I'd prefer to separate the "yaml validation" from the "list and filter files" logic
  • we have to validate the yaml in create_commit in any case (not a problem to do it twice if upload_folder has been called)

src/huggingface_hub/hf_api.py Outdated Show resolved Hide resolved
Comment on lines 9200 to 9210
for relpath in filtered_repo_objects:
if relpath == "README.md":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And then, instead of the for ... in ...: if ... == "README.md": .... ... ... break logic, could you use relpath_to_abspath instead like this:

if "README.md" in filtered_repo_objects:
    self._validate_yaml(
        content=relpath_to_abspath["README.md"].read_text(),
        repo_type=repo_type,
        token=token,
    )

This way, it's not even needed to build the CommitOperationAdd just to read the file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, much better to use read_text from pathlib and avoid the for loop.

src/huggingface_hub/hf_api.py Outdated Show resolved Hide resolved
@hlky hlky force-pushed the validate-before-file-hash branch 2 times, most recently from 397f9d4 to dfdf1e5 Compare August 16, 2024 15:07
Copy link
Contributor

@Wauplin Wauplin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the prompt changes @hlky! Except for the docstring, I think we are close to merge this PR :) I'll also trigger the CI to check nothing broke.

src/huggingface_hub/hf_api.py Show resolved Hide resolved
headers = self._build_hf_headers(token=token)

response = get_session().post(
f"{ENDPOINT}/api/validate-yaml",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
f"{ENDPOINT}/api/validate-yaml",
f"{self.endpoint}/api/validate-yaml",

I just realized we must use self.endpoint (configurable by user) rather than ENDPOINT here! It was a mistake in existing implementation but better to fix it as part of this PR :)

Comment on lines 9224 to 9245
def _validate_yaml(self, content: str, *, repo_type: Optional[str] = None, token: Union[bool, str, None] = None):
headers = self._build_hf_headers(token=token)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a small docstring would be great yes. Thanks in advance!

@hlky hlky force-pushed the validate-before-file-hash branch from dfdf1e5 to 21af425 Compare August 16, 2024 15:19
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@hlky hlky force-pushed the validate-before-file-hash branch from 21af425 to 84921cf Compare August 16, 2024 15:21
Copy link
Contributor

@Wauplin Wauplin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the clear docstring! Looks ready to be merged! I'll just wait for the CI to pass before doing so :)

@hlky
Copy link
Contributor Author

hlky commented Aug 16, 2024

The only other thing I can think of is that there will be duplicate calls to the endpoint with upload_folder path, once during _prepare_upload_folder_additions then again during create_commit, if the extra load to the endpoint will be an issue there could be something like a boolean self.yaml_validated to skip the validation in create_commit.

@Wauplin
Copy link
Contributor

Wauplin commented Aug 16, 2024

Yeah, I thought the same as well - thanks for raising the question. Honestly, I don't think the extra overhead is really problematic. We could also have a small @functools.cache decorator on _validate_yaml (though it would be nice to cache only based on content+repo_type, not the token)

@hlky
Copy link
Contributor Author

hlky commented Aug 16, 2024

The fastai integration test failure appears to be because repo_type is None. It hasn't triggered before repo_type is set to default value in create_commit. I'd suggest we duplicate that to upload_folder.

import requests

content = '---\ntags:\n- fastai\n---\n\n# Amazing!\n\n🥳 Congratulations on hosting your fastai model on the Hugging Face Hub!\n\...n## Intended uses & limitations\nMore information needed\n\n## Training and evaluation data\nMore information needed\n'
url = "https://hub-ci.huggingface.co/api/validate-yaml"
r = requests.post(url, {"content": content, "repoType": None}) # {'error': '"value" does not match any of the allowed types'}

r = requests.post(url, {"content": content, "repoType": "model"}) # {'errors': [], 'warnings': []}

@hlky hlky force-pushed the validate-before-file-hash branch from 84921cf to 46d98b9 Compare August 16, 2024 16:06
@Wauplin
Copy link
Contributor

Wauplin commented Aug 16, 2024

Thanks for looking into it. Yes, I would simply duplicate this line in def _validate_yaml as you mentioned.

repo_type = repo_type if repo_type is not None else REPO_TYPE_MODEL

@@ -4849,6 +4833,7 @@ def upload_folder(

```
"""
repo_type = repo_type if repo_type is not None else REPO_TYPE_MODEL
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move this to _validate_yaml directly to prevent errors in the future in case we use it somewhere else

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I've moved it to _validate_yaml.

@hlky hlky force-pushed the validate-before-file-hash branch from 46d98b9 to 641b393 Compare August 16, 2024 16:11
@Wauplin
Copy link
Contributor

Wauplin commented Aug 16, 2024

Perfect! Thanks for the iterations on this @hlky! Failing tests are unrelated to this PR so we are good to merge now! 🎉

@Wauplin Wauplin merged commit e9cd695 into huggingface:main Aug 16, 2024
12 of 14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

huggingface-cli upload - Validate README.md before file hashing
3 participants