-
Notifications
You must be signed in to change notification settings - Fork 10.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
metadata: Detailed Dataset Authorship Metadata #8875
metadata: Detailed Dataset Authorship Metadata #8875
Conversation
set to draft as it's just a concept I came up with and might need a second opinion if this PR makes sense (Plus likely to fail typechecks anyway for my first stab) |
Added some extra tests for extra confidence with this feature. Also if we are doing this, we might as well go all the way and let people fully enter in the authorship metadata in base model and datasets in model card. Now happy enough with this PR for a review. Will also need to update the gguf spec doc if this passes |
@compilade if you got some time, have a look at this as I think this approach will at least be more intuitive for model card writers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this will coexist well with how HuggingFace processes the model cards, see https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/repocard_data.py.
Other than that, the implementation seems sound.
It seems like the only breaking change is the removal of Keys.General.DATASETS
, which seems reasonable.
@compilade looks like the HF team would have to do lots of refactoring to support both dicts and lists to support this concept fully. So at least according to @Wauplin it is unlikely they will be able to support more complex metadata that we are proposing in the near term (He doesn't object to the concept however, but the organisation priority for this feature would understandably be low). How would you prefer to approach this? I'm still proposing the refactoring of datasets in the main GGUF standard... so we can either separate just the metadata refactoring of datasets... or we can just merge as is... but note in documentation a warning that dict in modelcard and parent model is not supported in HuggingFace model card parser (Which would give HF time to support it eventually). |
b8d0e03
to
d32c74d
Compare
@compilade Okay, decided to refactor to move the 'dicts' in modelcard feature into a separate PR. Kept the 'url' in string as I don't think it's as much of a leap for the HF team to adopt if they are okay with it... plus it's already in string format so won't break at least the string typecheck on their side. |
Setting to draft until I have an approach to address huggingface/huggingface_hub#2479 (comment) . Ergo their recommendation is to shift the more advance dict usage into a separate model card metadata property. Will need to at least think up of a different name |
This is to address "Model Card: Allow for dicts in datasets and base_model and also update spec" in huggingface/huggingface_hub#2479 where we would like to add detailed metadata support for both base model and datashet but in a way that huggingface will eventually be able to support (They are currently using either a string or string list... we will be using a list of dict which would be extensible). They recommended creating a seperate metadata property for this.
233f89b
to
9a46519
Compare
@compilade @Galunid after a while, I think I shall suggest that users would canonically use This is picked to avoid conflict with Hugging Face Hub's existing definition for base_model and datasets which they said is restricted to just str and str list. For them to more easily integrate our suggestion, they request we pick a different metadata property name. Took some time to think it over. I think the below makes most sense.
(Once merged... I'll update the wiki to include the new suggested model card approach we prefer) |
@ggerganov can you review this PR? It's been though a few revision and both @compilade and @lin72h gave the thumbs up at least (but I'm guessing they would like someone more experienced to put the final stamp of approval). FYI, I had a chat with the Hugging Face team in this issue ticket and they suggested using a different property name for the detailed authorship (so it doesn't conflict with their implementation which assumes string values). However they expressed interest in integrating our approach if its adopted more widely. This correction/consideration is already integrated into this PR. I've looked thought a few other PRs I've created (and closed a few), but I think this one is still relevant. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems ok to merge. I'm not familiar with datasets usage, but as long as it does not interfere with normal model usage, it should be ok.
Appreciated. Had a quick lookover again, regarding any potential interference to normal model usage... it will read normal model metadata in huggingface in a backwards compatible way. Since we can deal with both list of string and list of dict. It's only HuggingFace's code that is a bit inflexible, hence their request to have a separate field for detailed metadata on model and dataset. |
Converter script can now read these two fields as a detailed base model and dataset source. This was done so that it will be easier for Hugging Face to integrate detailed metadata as needed. - base_model_sources (List[dict], optional) - dataset_sources (List[dict], optional) Dataset now represented as: - general.dataset.count - general.dataset.{id}.name - general.dataset.{id}.author - general.dataset.{id}.version - general.dataset.{id}.organization - general.dataset.{id}.description - general.dataset.{id}.url - general.dataset.{id}.doi - general.dataset.{id}.uuid - general.dataset.{id}.repo_url This also adds to base model these metadata: - general.base_model.{id}.description
Converter script can now read these two fields as a detailed base model and dataset source. This was done so that it will be easier for Hugging Face to integrate detailed metadata as needed. - base_model_sources (List[dict], optional) - dataset_sources (List[dict], optional) Dataset now represented as: - general.dataset.count - general.dataset.{id}.name - general.dataset.{id}.author - general.dataset.{id}.version - general.dataset.{id}.organization - general.dataset.{id}.description - general.dataset.{id}.url - general.dataset.{id}.doi - general.dataset.{id}.uuid - general.dataset.{id}.repo_url This also adds to base model these metadata: - general.base_model.{id}.description
Converter script can now read these two fields as a detailed base model and dataset source. This was done so that it will be easier for Hugging Face to integrate detailed metadata as needed. - base_model_sources (List[dict], optional) - dataset_sources (List[dict], optional) Dataset now represented as: - general.dataset.count - general.dataset.{id}.name - general.dataset.{id}.author - general.dataset.{id}.version - general.dataset.{id}.organization - general.dataset.{id}.description - general.dataset.{id}.url - general.dataset.{id}.doi - general.dataset.{id}.uuid - general.dataset.{id}.repo_url This also adds to base model these metadata: - general.base_model.{id}.description
On second thought, the dataset might actually be more important than the models. So we may want to add detailed name, author, url, etc... for it as well.