Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metadata: Detailed Dataset Authorship Metadata #8875

Merged
merged 3 commits into from
Nov 13, 2024

Conversation

mofosyne
Copy link
Collaborator

@mofosyne mofosyne commented Aug 5, 2024

On second thought, the dataset might actually be more important than the models. So we may want to add detailed name, author, url, etc... for it as well.

@mofosyne mofosyne added the Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix label Aug 5, 2024
@github-actions github-actions bot added examples python python script changes labels Aug 5, 2024
@mofosyne mofosyne marked this pull request as draft August 5, 2024 15:26
@mofosyne
Copy link
Collaborator Author

mofosyne commented Aug 5, 2024

set to draft as it's just a concept I came up with and might need a second opinion if this PR makes sense (Plus likely to fail typechecks anyway for my first stab)

@mofosyne mofosyne marked this pull request as ready for review August 5, 2024 21:27
@mofosyne mofosyne requested a review from compilade August 7, 2024 12:26
@mofosyne
Copy link
Collaborator Author

mofosyne commented Aug 7, 2024

Added some extra tests for extra confidence with this feature.

Also if we are doing this, we might as well go all the way and let people fully enter in the authorship metadata in base model and datasets in model card.

Now happy enough with this PR for a review. Will also need to update the gguf spec doc if this passes

@mofosyne
Copy link
Collaborator Author

@compilade if you got some time, have a look at this as I think this approach will at least be more intuitive for model card writers.

@mofosyne mofosyne changed the title py: Detailed Dataset Authorship Metadata metadata: Detailed Dataset Authorship Metadata Aug 10, 2024
@mofosyne mofosyne added Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level and removed Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix labels Aug 13, 2024
@mofosyne mofosyne requested a review from Galunid August 17, 2024 01:39
Copy link
Collaborator

@compilade compilade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this will coexist well with how HuggingFace processes the model cards, see https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/repocard_data.py.

Other than that, the implementation seems sound.

It seems like the only breaking change is the removal of Keys.General.DATASETS, which seems reasonable.

gguf-py/gguf/metadata.py Outdated Show resolved Hide resolved
gguf-py/tests/test_metadata.py Outdated Show resolved Hide resolved
gguf-py/gguf/metadata.py Show resolved Hide resolved
@mofosyne
Copy link
Collaborator Author

mofosyne commented Aug 28, 2024

@compilade looks like the HF team would have to do lots of refactoring to support both dicts and lists to support this concept fully. So at least according to @Wauplin it is unlikely they will be able to support more complex metadata that we are proposing in the near term (He doesn't object to the concept however, but the organisation priority for this feature would understandably be low).

How would you prefer to approach this? I'm still proposing the refactoring of datasets in the main GGUF standard... so we can either separate just the metadata refactoring of datasets... or we can just merge as is... but note in documentation a warning that dict in modelcard and parent model is not supported in HuggingFace model card parser (Which would give HF time to support it eventually).

@mofosyne
Copy link
Collaborator Author

mofosyne commented Aug 28, 2024

@compilade Okay, decided to refactor to move the 'dicts' in modelcard feature into a separate PR.

Kept the 'url' in string as I don't think it's as much of a leap for the HF team to adopt if they are okay with it... plus it's already in string format so won't break at least the string typecheck on their side.

@mofosyne mofosyne marked this pull request as draft September 8, 2024 15:29
@mofosyne
Copy link
Collaborator Author

mofosyne commented Sep 8, 2024

Setting to draft until I have an approach to address huggingface/huggingface_hub#2479 (comment) . Ergo their recommendation is to shift the more advance dict usage into a separate model card metadata property. Will need to at least think up of a different name

This is to address "Model Card: Allow for dicts in datasets and base_model and also update spec" in huggingface/huggingface_hub#2479 where we would like to add detailed metadata support for both base model and datashet but in a way that huggingface will eventually be able to support (They are currently using either a string or string list... we will be using a list of dict which would be extensible). They recommended creating a seperate metadata property for this.
@mofosyne mofosyne force-pushed the feature-metadata-detailed-dataset branch from 233f89b to 9a46519 Compare October 7, 2024 11:56
@mofosyne
Copy link
Collaborator Author

mofosyne commented Oct 7, 2024

@compilade @Galunid after a while, I think I shall suggest that users would canonically use base_model_sources and dataset_sources if they want to include detailed metadata for parent base model and datasets.

This is picked to avoid conflict with Hugging Face Hub's existing definition for base_model and datasets which they said is restricted to just str and str list. For them to more easily integrate our suggestion, they request we pick a different metadata property name.

Took some time to think it over. I think the below makes most sense.

  • base_model_sources (List[dict], optional)
  • dataset_sources (List[dict], optional)

(Once merged... I'll update the wiki to include the new suggested model card approach we prefer)

@mofosyne mofosyne marked this pull request as ready for review October 7, 2024 12:05
@mofosyne mofosyne requested a review from Galunid October 7, 2024 12:05
@mofosyne mofosyne requested a review from compilade October 7, 2024 12:05
@mofosyne mofosyne requested a review from slaren October 21, 2024 10:13
@mofosyne
Copy link
Collaborator Author

mofosyne commented Nov 12, 2024

@ggerganov can you review this PR? It's been though a few revision and both @compilade and @lin72h gave the thumbs up at least (but I'm guessing they would like someone more experienced to put the final stamp of approval).

FYI, I had a chat with the Hugging Face team in this issue ticket and they suggested using a different property name for the detailed authorship (so it doesn't conflict with their implementation which assumes string values). However they expressed interest in integrating our approach if its adopted more widely. This correction/consideration is already integrated into this PR.

I've looked thought a few other PRs I've created (and closed a few), but I think this one is still relevant.

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems ok to merge. I'm not familiar with datasets usage, but as long as it does not interfere with normal model usage, it should be ok.

@mofosyne mofosyne merged commit a0ec17b into ggerganov:master Nov 13, 2024
8 checks passed
@mofosyne
Copy link
Collaborator Author

Appreciated. Had a quick lookover again, regarding any potential interference to normal model usage... it will read normal model metadata in huggingface in a backwards compatible way. Since we can deal with both list of string and list of dict. It's only HuggingFace's code that is a bit inflexible, hence their request to have a separate field for detailed metadata on model and dataset.

@mofosyne mofosyne deleted the feature-metadata-detailed-dataset branch November 13, 2024 10:13
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024
Converter script can now read these two fields as a detailed base model and dataset source.
This was done so that it will be easier for Hugging Face to integrate detailed metadata as needed.

 -  base_model_sources (List[dict], optional)
 -  dataset_sources (List[dict], optional)

Dataset now represented as:

 - general.dataset.count
 - general.dataset.{id}.name
 - general.dataset.{id}.author
 - general.dataset.{id}.version
 - general.dataset.{id}.organization
 - general.dataset.{id}.description
 - general.dataset.{id}.url
 - general.dataset.{id}.doi
 - general.dataset.{id}.uuid
 - general.dataset.{id}.repo_url

This also adds to base model these metadata:

 - general.base_model.{id}.description
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 17, 2024
Converter script can now read these two fields as a detailed base model and dataset source.
This was done so that it will be easier for Hugging Face to integrate detailed metadata as needed.

 -  base_model_sources (List[dict], optional)
 -  dataset_sources (List[dict], optional)

Dataset now represented as:

 - general.dataset.count
 - general.dataset.{id}.name
 - general.dataset.{id}.author
 - general.dataset.{id}.version
 - general.dataset.{id}.organization
 - general.dataset.{id}.description
 - general.dataset.{id}.url
 - general.dataset.{id}.doi
 - general.dataset.{id}.uuid
 - general.dataset.{id}.repo_url

This also adds to base model these metadata:

 - general.base_model.{id}.description
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024
Converter script can now read these two fields as a detailed base model and dataset source.
This was done so that it will be easier for Hugging Face to integrate detailed metadata as needed.

 -  base_model_sources (List[dict], optional)
 -  dataset_sources (List[dict], optional)

Dataset now represented as:

 - general.dataset.count
 - general.dataset.{id}.name
 - general.dataset.{id}.author
 - general.dataset.{id}.version
 - general.dataset.{id}.organization
 - general.dataset.{id}.description
 - general.dataset.{id}.url
 - general.dataset.{id}.doi
 - general.dataset.{id}.uuid
 - general.dataset.{id}.repo_url

This also adds to base model these metadata:

 - general.base_model.{id}.description
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples python python script changes Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants