metadata: Detailed Dataset Authorship Metadata #8875

mofosyne · 2024-08-05T15:26:39Z

On second thought, the dataset might actually be more important than the models. So we may want to add detailed name, author, url, etc... for it as well.

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

mofosyne · 2024-08-05T15:27:46Z

set to draft as it's just a concept I came up with and might need a second opinion if this PR makes sense (Plus likely to fail typechecks anyway for my first stab)

mofosyne · 2024-08-07T12:28:06Z

Added some extra tests for extra confidence with this feature.

Also if we are doing this, we might as well go all the way and let people fully enter in the authorship metadata in base model and datasets in model card.

Now happy enough with this PR for a review. Will also need to update the gguf spec doc if this passes

mofosyne · 2024-08-10T10:40:57Z

@compilade if you got some time, have a look at this as I think this approach will at least be more intuitive for model card writers.

examples/convert_legacy_llama.py

gguf-py/gguf/metadata.py

compilade

Not sure if this will coexist well with how HuggingFace processes the model cards, see https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/repocard_data.py.

Other than that, the implementation seems sound.

It seems like the only breaking change is the removal of Keys.General.DATASETS, which seems reasonable.

gguf-py/gguf/metadata.py

gguf-py/tests/test_metadata.py

gguf-py/gguf/metadata.py

mofosyne · 2024-08-28T11:05:23Z

@compilade looks like the HF team would have to do lots of refactoring to support both dicts and lists to support this concept fully. So at least according to @Wauplin it is unlikely they will be able to support more complex metadata that we are proposing in the near term (He doesn't object to the concept however, but the organisation priority for this feature would understandably be low).

How would you prefer to approach this? I'm still proposing the refactoring of datasets in the main GGUF standard... so we can either separate just the metadata refactoring of datasets... or we can just merge as is... but note in documentation a warning that dict in modelcard and parent model is not supported in HuggingFace model card parser (Which would give HF time to support it eventually).

mofosyne · 2024-08-28T11:54:52Z

@compilade Okay, decided to refactor to move the 'dicts' in modelcard feature into a separate PR.

Kept the 'url' in string as I don't think it's as much of a leap for the HF team to adopt if they are okay with it... plus it's already in string format so won't break at least the string typecheck on their side.

mofosyne · 2024-09-08T15:30:47Z

Setting to draft until I have an approach to address huggingface/huggingface_hub#2479 (comment) . Ergo their recommendation is to shift the more advance dict usage into a separate model card metadata property. Will need to at least think up of a different name

This is to address "Model Card: Allow for dicts in datasets and base_model and also update spec" in huggingface/huggingface_hub#2479 where we would like to add detailed metadata support for both base model and datashet but in a way that huggingface will eventually be able to support (They are currently using either a string or string list... we will be using a list of dict which would be extensible). They recommended creating a seperate metadata property for this.

mofosyne · 2024-10-07T12:05:14Z

@compilade @Galunid after a while, I think I shall suggest that users would canonically use base_model_sources and dataset_sources if they want to include detailed metadata for parent base model and datasets.

This is picked to avoid conflict with Hugging Face Hub's existing definition for base_model and datasets which they said is restricted to just str and str list. For them to more easily integrate our suggestion, they request we pick a different metadata property name.

Took some time to think it over. I think the below makes most sense.

base_model_sources (List[dict], optional)
dataset_sources (List[dict], optional)

(Once merged... I'll update the wiki to include the new suggested model card approach we prefer)

mofosyne · 2024-11-12T12:36:27Z

@ggerganov can you review this PR? It's been though a few revision and both @compilade and @lin72h gave the thumbs up at least (but I'm guessing they would like someone more experienced to put the final stamp of approval).

FYI, I had a chat with the Hugging Face team in this issue ticket and they suggested using a different property name for the detailed authorship (so it doesn't conflict with their implementation which assumes string values). However they expressed interest in integrating our approach if its adopted more widely. This correction/consideration is already integrated into this PR.

I've looked thought a few other PRs I've created (and closed a few), but I think this one is still relevant.

ggerganov

Seems ok to merge. I'm not familiar with datasets usage, but as long as it does not interfere with normal model usage, it should be ok.

mofosyne · 2024-11-13T10:12:47Z

Appreciated. Had a quick lookover again, regarding any potential interference to normal model usage... it will read normal model metadata in huggingface in a backwards compatible way. Since we can deal with both list of string and list of dict. It's only HuggingFace's code that is a bit inflexible, hence their request to have a separate field for detailed metadata on model and dataset.

Converter script can now read these two fields as a detailed base model and dataset source. This was done so that it will be easier for Hugging Face to integrate detailed metadata as needed. - base_model_sources (List[dict], optional) - dataset_sources (List[dict], optional) Dataset now represented as: - general.dataset.count - general.dataset.{id}.name - general.dataset.{id}.author - general.dataset.{id}.version - general.dataset.{id}.organization - general.dataset.{id}.description - general.dataset.{id}.url - general.dataset.{id}.doi - general.dataset.{id}.uuid - general.dataset.{id}.repo_url This also adds to base model these metadata: - general.base_model.{id}.description

mofosyne added the Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix label Aug 5, 2024

github-actions bot added examples python python script changes labels Aug 5, 2024

mofosyne marked this pull request as draft August 5, 2024 15:26

mofosyne marked this pull request as ready for review August 5, 2024 21:27

mofosyne requested a review from compilade August 7, 2024 12:26

mofosyne changed the title ~~py: Detailed Dataset Authorship Metadata~~ metadata: Detailed Dataset Authorship Metadata Aug 10, 2024

mofosyne added Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level and removed Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix labels Aug 13, 2024

mofosyne requested a review from Galunid August 17, 2024 01:39

Galunid reviewed Aug 17, 2024

View reviewed changes

examples/convert_legacy_llama.py Show resolved Hide resolved

Galunid reviewed Aug 17, 2024

View reviewed changes

gguf-py/gguf/metadata.py Show resolved Hide resolved

compilade reviewed Aug 18, 2024

View reviewed changes

gguf-py/gguf/metadata.py Outdated Show resolved Hide resolved

gguf-py/tests/test_metadata.py Outdated Show resolved Hide resolved

gguf-py/gguf/metadata.py Show resolved Hide resolved

mofosyne mentioned this pull request Aug 22, 2024

Model Card: Allow for dicts in datasets and base_model and also update spec huggingface/huggingface_hub#2479

Open

py: Detailed datasets metadata in gguf kv store

d32c74d

mofosyne force-pushed the feature-metadata-detailed-dataset branch from b8d0e03 to d32c74d Compare August 28, 2024 11:37

mofosyne mentioned this pull request Aug 28, 2024

metadata: Allow model creators to use modelcard datasets and base_model as a list of dicts #9221

Closed

4 tasks

mofosyne marked this pull request as draft September 8, 2024 15:29

mofosyne added 2 commits October 7, 2024 22:47

py: let users add full base model and dataset to model_card

6400391

mofosyne force-pushed the feature-metadata-detailed-dataset branch from 233f89b to 9a46519 Compare October 7, 2024 11:56

mofosyne marked this pull request as ready for review October 7, 2024 12:05

mofosyne requested a review from Galunid October 7, 2024 12:05

mofosyne requested a review from compilade October 7, 2024 12:05

mofosyne requested a review from slaren October 21, 2024 10:13

ggerganov approved these changes Nov 12, 2024

View reviewed changes

mofosyne merged commit a0ec17b into ggerganov:master Nov 13, 2024
8 checks passed

mofosyne deleted the feature-metadata-detailed-dataset branch November 13, 2024 10:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

metadata: Detailed Dataset Authorship Metadata #8875

metadata: Detailed Dataset Authorship Metadata #8875

mofosyne commented Aug 5, 2024 •

edited

Loading

mofosyne commented Aug 5, 2024 •

edited

Loading

mofosyne commented Aug 7, 2024 •

edited

Loading

mofosyne commented Aug 10, 2024

compilade left a comment

mofosyne commented Aug 28, 2024 •

edited

Loading

mofosyne commented Aug 28, 2024 •

edited

Loading

mofosyne commented Sep 8, 2024 •

edited

Loading

mofosyne commented Oct 7, 2024 •

edited

Loading

mofosyne commented Nov 12, 2024 •

edited

Loading

ggerganov left a comment

mofosyne commented Nov 13, 2024

metadata: Detailed Dataset Authorship Metadata #8875

metadata: Detailed Dataset Authorship Metadata #8875

Conversation

mofosyne commented Aug 5, 2024 • edited Loading

mofosyne commented Aug 5, 2024 • edited Loading

mofosyne commented Aug 7, 2024 • edited Loading

mofosyne commented Aug 10, 2024

compilade left a comment

Choose a reason for hiding this comment

mofosyne commented Aug 28, 2024 • edited Loading

mofosyne commented Aug 28, 2024 • edited Loading

mofosyne commented Sep 8, 2024 • edited Loading

mofosyne commented Oct 7, 2024 • edited Loading

mofosyne commented Nov 12, 2024 • edited Loading

ggerganov left a comment

Choose a reason for hiding this comment

mofosyne commented Nov 13, 2024

mofosyne commented Aug 5, 2024 •

edited

Loading

mofosyne commented Aug 5, 2024 •

edited

Loading

mofosyne commented Aug 7, 2024 •

edited

Loading

mofosyne commented Aug 28, 2024 •

edited

Loading

mofosyne commented Aug 28, 2024 •

edited

Loading

mofosyne commented Sep 8, 2024 •

edited

Loading

mofosyne commented Oct 7, 2024 •

edited

Loading

mofosyne commented Nov 12, 2024 •

edited

Loading