-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automate vocab support and model conversion #7379
Draft
teleprint-me
wants to merge
102
commits into
ggerganov:master
Choose a base branch
from
teleprint-me:auto-model-support
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+1,437
−2,196
Draft
Changes from 1 commit
Commits
Show all changes
102 commits
Select commit
Hold shift + click to select a range
dbdf6c2
feat: Add prototype for managing huggingface hub content
teleprint-me ba13d64
feat: Add utils for logging and writing when interacting with Hugging…
teleprint-me 742abeb
refactor: Add log for status and fix url path variable name
teleprint-me 98cf788
patch: Apply minor fixes for handling headers and writing content
teleprint-me 4790f76
feat: Add prototype for requesting vocab related files
teleprint-me 5c8144e
feat: Add download_model method and fix references for clarity to mit…
teleprint-me 1a286c8
refactor: Clean up variable names and separate concerns when download…
teleprint-me 3ba01c7
chore: Fix spacing
teleprint-me f7515ab
feat: Add tokenizer types, model types, and model repos
teleprint-me 3022587
refactor: Apply model schema to tokenizer downloads
teleprint-me b2ca23c
feat: Add method for generating the checksums and writing the results…
teleprint-me 5eda2c9
feat: Add pre-tokenizer logging
teleprint-me 2ef73ee
refactor: Apply SoC for HF requests, vocab, and weights
teleprint-me 04fb788
chore: Apply isort to package gguf init
teleprint-me 832b449
feat: Add pre-tokenizer CLI tooling
teleprint-me b6f70b8
chore: Fix line spacing
teleprint-me 006bb60
chore: Fix model path references
teleprint-me 1a82573
feat: Add example script for automating generating tokenizer model ch…
teleprint-me 4b3735c
chore: Remove cluttered vocab files
teleprint-me 0479e96
patch: Add exception handling for non-existent vocab related files
teleprint-me bd32266
feat: Add function for generating vocab script and fix CLI opts
teleprint-me d02a0f4
feat: Add vocab generation script
teleprint-me ce777c8
Merge branch 'master' into auto-model-support
teleprint-me da5deeb
fix: Apply fix to verbose help description and generating vocab tests…
teleprint-me 316b404
patch: Fix CLI option for generating vocab tests
teleprint-me 5840b6f
refactor: Simplify the get_vocab_base_pre method
teleprint-me dcc5d42
fix: Remove dangling if statement
teleprint-me c6f2a48
feat: Add prototype for identifying the vocab type
teleprint-me 89a46fe
feat: Attempt to mirror the llama.cpp API for compatibility
teleprint-me a0362ea
patch: Fix nested quotes for dict refs
teleprint-me 9a2834e
fix: Use __name__ as logger name
teleprint-me 381dad5
fix: Add missing model architectures
teleprint-me 6fc4492
chore: Add english pangram to vocab tests
teleprint-me a1951e2
refactor: Add proper names for remote model references
teleprint-me bdd0286
refactor: Use proper names for referenced member variables
teleprint-me 18bb36e
chore: Allow the user to config the logger
teleprint-me d9ba963
refactor: Restructure tokenizer model metadata
teleprint-me 2fa2c7a
chore: Move enums and model map to constants
teleprint-me 5978bb0
chore: Fix and update comments
teleprint-me 12537fd
chore: Add tokenizer constants for model metadata
teleprint-me aed0573
proto: Add experimental vocab pre-tokenizer regular expressions
teleprint-me a35b767
Merge branch 'master' into auto-model-support
teleprint-me 6296206
chore: Apply deduped token type references
teleprint-me a3bdac0
chore: Remove unused enum import reference
teleprint-me fb32f50
feat: Add hf model mapping descriptors for each repo
teleprint-me 4768650
chore: Add formatting, set common vocab files, apply pattern to model…
teleprint-me 2fe28ad
chore: Rename from repo to model repo and reorder for improved readab…
teleprint-me 83b9fcd
refactor: Rename constants to reduce confusion between references
teleprint-me b2aac68
docs: Fix comment
teleprint-me 34e14ae
refactor: Add experimental model mappings
teleprint-me 0b43e14
refactor: Add experimental mapping for BPE pre-tokenizers
teleprint-me 12285b5
chore: Map model file and vocab types
teleprint-me 1957ca4
refactor: Simplify BPE pre-tokenizer mapping
teleprint-me cd00be8
chore: Add model metadata
teleprint-me 78d7828
chore: Add prototyped CLI options
teleprint-me 9814b7f
feat: Add custom huggingface hub api
teleprint-me 9ba6b92
chore: Add required vocabulary constants
teleprint-me 0ccf579
refactor: Apply consistent naming conventions
teleprint-me c92c6ad
feat: Add CLI tool for fetching vocab files
teleprint-me 1749209
refactor: Simplify huggingface hub api implementation
teleprint-me f62080a
refactor: Simplify huggingface hub vocab request
teleprint-me ea4fc10
refactor: Apply fixes to required arguments and fixes to options
teleprint-me b4b553f
chore: Apply ruff formatting for readability
teleprint-me 77bc739
refactor: Add tokenizer path, add methods for extracting vocab metada…
teleprint-me c91dcdf
refactor: Add fixes for logging
teleprint-me e62e09b
refactor: Apply fix for file path references
teleprint-me 6c9ac0f
refactor: Add a custom tokenizer component and fix vocab request class
teleprint-me 6409694
refactor: Simplify the huggingface hub api to enable flexible model r…
teleprint-me 6da2bd6
patch: Apply fix for paths and logging
teleprint-me 168297f
refactor: Add remote repository listings to the bas HFHub class
teleprint-me 99275a1
refactor: Simplify API and merge HFModel into HFHub
teleprint-me 4438d05
refactor: Abstract file and logger management to streamline api inter…
teleprint-me fda2319
refactor: Streamline method signatures and clarify method names relat…
teleprint-me 2ffe6b8
Refactor HFubModel and HFHubTokenizer to fix reference issues
teleprint-me 63c3410
refactor: Add support for model file types
teleprint-me 6c1b011
refactor: Apply huggingface_hub api to CLI
teleprint-me e9759de
docs: Add revisions to hub-vocab.py module level docstring
teleprint-me f30bd63
refactor: Add function for building and parsing CLI arguments
teleprint-me da72554
feat: Add static methods for resolving model types and model extensions
teleprint-me fcd20ab
chore: Add comments for each file extension type
teleprint-me e4275bc
feat: Add example script for downloading models
teleprint-me b3a5429
Merge branch 'huggingface-hub-api' into auto-model-support
teleprint-me 36bea17
Merge branch 'master' into auto-model-support
teleprint-me 7f48eb9
feat: Add experimental model registry for known models and their rela…
teleprint-me b1c922f
feat: Add a proto sketch for handling mode vocab metadata
teleprint-me 0732bd9
feat: Ignore pre-existing model files
teleprint-me 2153949
feat: Add prototype for bootstrapping registry
teleprint-me 0a478c0
chore: Add pre tokenizers and include enum mappings
teleprint-me 9dbc957
refactor: Simplify tokenizers implementation
teleprint-me aa28cfe
chore: Fix import path, token comparisons, and update token type refe…
teleprint-me f1d067e
refactor: Simplify huggingface hub api and update to reflect changes …
teleprint-me 5c92809
refactor: Apply updates to example script for generating the registry
teleprint-me 6a725cf
Merge branch 'master' into auto-model-support
teleprint-me de0f0d0
Merge branch 'master' into auto-model-support
teleprint-me c2e4897
Merge branch 'master' into auto-model-support
teleprint-me 47ef615
refactor: Add prototyped bridge interface for tokenizers and llama.cpp
teleprint-me c447010
refactor: Add prototyped bridge interface for transformers and tokeni…
teleprint-me 647d252
patch: Apply fix for backward compat for source repo
teleprint-me 250bddf
Merge branch 'master' into auto-model-support
teleprint-me e2b7608
chore: Add ignore rule for generated server themes
teleprint-me ce8524a
Merge branch 'ignore-gen-themes' into auto-model-support
teleprint-me 5836d6c
refactor: Clean up constants and simplify the custom hf hub api
teleprint-me File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
chore: Add model metadata
- Loading branch information
commit cd00be886f417cace4e74f33371af5bcff375bfe
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mofosyne I'm thinking about adding a registry. This way we can register the necessary metadata for each model. The easiest way I can think about doing it for now is to write it to a JSON file. What do you think?