Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automate vocab support and model conversion #7379

Draft
wants to merge 102 commits into
base: master
Choose a base branch
from
Draft
Changes from 1 commit
Commits
Show all changes
102 commits
Select commit Hold shift + click to select a range
dbdf6c2
feat: Add prototype for managing huggingface hub content
teleprint-me May 18, 2024
ba13d64
feat: Add utils for logging and writing when interacting with Hugging…
teleprint-me May 18, 2024
742abeb
refactor: Add log for status and fix url path variable name
teleprint-me May 18, 2024
98cf788
patch: Apply minor fixes for handling headers and writing content
teleprint-me May 18, 2024
4790f76
feat: Add prototype for requesting vocab related files
teleprint-me May 18, 2024
5c8144e
feat: Add download_model method and fix references for clarity to mit…
teleprint-me May 18, 2024
1a286c8
refactor: Clean up variable names and separate concerns when download…
teleprint-me May 18, 2024
3ba01c7
chore: Fix spacing
teleprint-me May 18, 2024
f7515ab
feat: Add tokenizer types, model types, and model repos
teleprint-me May 18, 2024
3022587
refactor: Apply model schema to tokenizer downloads
teleprint-me May 18, 2024
b2ca23c
feat: Add method for generating the checksums and writing the results…
teleprint-me May 18, 2024
5eda2c9
feat: Add pre-tokenizer logging
teleprint-me May 18, 2024
2ef73ee
refactor: Apply SoC for HF requests, vocab, and weights
teleprint-me May 18, 2024
04fb788
chore: Apply isort to package gguf init
teleprint-me May 18, 2024
832b449
feat: Add pre-tokenizer CLI tooling
teleprint-me May 18, 2024
b6f70b8
chore: Fix line spacing
teleprint-me May 18, 2024
006bb60
chore: Fix model path references
teleprint-me May 18, 2024
1a82573
feat: Add example script for automating generating tokenizer model ch…
teleprint-me May 19, 2024
4b3735c
chore: Remove cluttered vocab files
teleprint-me May 19, 2024
0479e96
patch: Add exception handling for non-existent vocab related files
teleprint-me May 19, 2024
bd32266
feat: Add function for generating vocab script and fix CLI opts
teleprint-me May 19, 2024
d02a0f4
feat: Add vocab generation script
teleprint-me May 19, 2024
ce777c8
Merge branch 'master' into auto-model-support
teleprint-me May 19, 2024
da5deeb
fix: Apply fix to verbose help description and generating vocab tests…
teleprint-me May 19, 2024
316b404
patch: Fix CLI option for generating vocab tests
teleprint-me May 19, 2024
5840b6f
refactor: Simplify the get_vocab_base_pre method
teleprint-me May 19, 2024
dcc5d42
fix: Remove dangling if statement
teleprint-me May 19, 2024
c6f2a48
feat: Add prototype for identifying the vocab type
teleprint-me May 20, 2024
89a46fe
feat: Attempt to mirror the llama.cpp API for compatibility
teleprint-me May 20, 2024
a0362ea
patch: Fix nested quotes for dict refs
teleprint-me May 20, 2024
9a2834e
fix: Use __name__ as logger name
teleprint-me May 20, 2024
381dad5
fix: Add missing model architectures
teleprint-me May 20, 2024
6fc4492
chore: Add english pangram to vocab tests
teleprint-me May 20, 2024
a1951e2
refactor: Add proper names for remote model references
teleprint-me May 20, 2024
bdd0286
refactor: Use proper names for referenced member variables
teleprint-me May 20, 2024
18bb36e
chore: Allow the user to config the logger
teleprint-me May 20, 2024
d9ba963
refactor: Restructure tokenizer model metadata
teleprint-me May 20, 2024
2fa2c7a
chore: Move enums and model map to constants
teleprint-me May 20, 2024
5978bb0
chore: Fix and update comments
teleprint-me May 20, 2024
12537fd
chore: Add tokenizer constants for model metadata
teleprint-me May 21, 2024
aed0573
proto: Add experimental vocab pre-tokenizer regular expressions
teleprint-me May 21, 2024
a35b767
Merge branch 'master' into auto-model-support
teleprint-me May 21, 2024
6296206
chore: Apply deduped token type references
teleprint-me May 21, 2024
a3bdac0
chore: Remove unused enum import reference
teleprint-me May 21, 2024
fb32f50
feat: Add hf model mapping descriptors for each repo
teleprint-me May 21, 2024
4768650
chore: Add formatting, set common vocab files, apply pattern to model…
teleprint-me May 21, 2024
2fe28ad
chore: Rename from repo to model repo and reorder for improved readab…
teleprint-me May 21, 2024
83b9fcd
refactor: Rename constants to reduce confusion between references
teleprint-me May 21, 2024
b2aac68
docs: Fix comment
teleprint-me May 21, 2024
34e14ae
refactor: Add experimental model mappings
teleprint-me May 21, 2024
0b43e14
refactor: Add experimental mapping for BPE pre-tokenizers
teleprint-me May 22, 2024
12285b5
chore: Map model file and vocab types
teleprint-me May 22, 2024
1957ca4
refactor: Simplify BPE pre-tokenizer mapping
teleprint-me May 22, 2024
cd00be8
chore: Add model metadata
teleprint-me May 22, 2024
78d7828
chore: Add prototyped CLI options
teleprint-me May 22, 2024
9814b7f
feat: Add custom huggingface hub api
teleprint-me May 23, 2024
9ba6b92
chore: Add required vocabulary constants
teleprint-me May 23, 2024
0ccf579
refactor: Apply consistent naming conventions
teleprint-me May 23, 2024
c92c6ad
feat: Add CLI tool for fetching vocab files
teleprint-me May 24, 2024
1749209
refactor: Simplify huggingface hub api implementation
teleprint-me May 24, 2024
f62080a
refactor: Simplify huggingface hub vocab request
teleprint-me May 24, 2024
ea4fc10
refactor: Apply fixes to required arguments and fixes to options
teleprint-me May 24, 2024
b4b553f
chore: Apply ruff formatting for readability
teleprint-me May 24, 2024
77bc739
refactor: Add tokenizer path, add methods for extracting vocab metada…
teleprint-me May 24, 2024
c91dcdf
refactor: Add fixes for logging
teleprint-me May 24, 2024
e62e09b
refactor: Apply fix for file path references
teleprint-me May 24, 2024
6c9ac0f
refactor: Add a custom tokenizer component and fix vocab request class
teleprint-me May 24, 2024
6409694
refactor: Simplify the huggingface hub api to enable flexible model r…
teleprint-me May 24, 2024
6da2bd6
patch: Apply fix for paths and logging
teleprint-me May 25, 2024
168297f
refactor: Add remote repository listings to the bas HFHub class
teleprint-me May 25, 2024
99275a1
refactor: Simplify API and merge HFModel into HFHub
teleprint-me May 25, 2024
4438d05
refactor: Abstract file and logger management to streamline api inter…
teleprint-me May 25, 2024
fda2319
refactor: Streamline method signatures and clarify method names relat…
teleprint-me May 25, 2024
2ffe6b8
Refactor HFubModel and HFHubTokenizer to fix reference issues
teleprint-me May 25, 2024
63c3410
refactor: Add support for model file types
teleprint-me May 25, 2024
6c1b011
refactor: Apply huggingface_hub api to CLI
teleprint-me May 25, 2024
e9759de
docs: Add revisions to hub-vocab.py module level docstring
teleprint-me May 25, 2024
f30bd63
refactor: Add function for building and parsing CLI arguments
teleprint-me May 25, 2024
da72554
feat: Add static methods for resolving model types and model extensions
teleprint-me May 25, 2024
fcd20ab
chore: Add comments for each file extension type
teleprint-me May 25, 2024
e4275bc
feat: Add example script for downloading models
teleprint-me May 25, 2024
b3a5429
Merge branch 'huggingface-hub-api' into auto-model-support
teleprint-me May 26, 2024
36bea17
Merge branch 'master' into auto-model-support
teleprint-me May 26, 2024
7f48eb9
feat: Add experimental model registry for known models and their rela…
teleprint-me May 27, 2024
b1c922f
feat: Add a proto sketch for handling mode vocab metadata
teleprint-me May 27, 2024
0732bd9
feat: Ignore pre-existing model files
teleprint-me May 27, 2024
2153949
feat: Add prototype for bootstrapping registry
teleprint-me May 27, 2024
0a478c0
chore: Add pre tokenizers and include enum mappings
teleprint-me May 27, 2024
9dbc957
refactor: Simplify tokenizers implementation
teleprint-me May 28, 2024
aa28cfe
chore: Fix import path, token comparisons, and update token type refe…
teleprint-me May 28, 2024
f1d067e
refactor: Simplify huggingface hub api and update to reflect changes …
teleprint-me May 28, 2024
5c92809
refactor: Apply updates to example script for generating the registry
teleprint-me May 28, 2024
6a725cf
Merge branch 'master' into auto-model-support
teleprint-me May 28, 2024
de0f0d0
Merge branch 'master' into auto-model-support
teleprint-me May 29, 2024
c2e4897
Merge branch 'master' into auto-model-support
teleprint-me May 31, 2024
47ef615
refactor: Add prototyped bridge interface for tokenizers and llama.cpp
teleprint-me Jun 1, 2024
c447010
refactor: Add prototyped bridge interface for transformers and tokeni…
teleprint-me Jun 1, 2024
647d252
patch: Apply fix for backward compat for source repo
teleprint-me Jun 1, 2024
250bddf
Merge branch 'master' into auto-model-support
teleprint-me Jun 2, 2024
e2b7608
chore: Add ignore rule for generated server themes
teleprint-me Jun 2, 2024
ce8524a
Merge branch 'ignore-gen-themes' into auto-model-support
teleprint-me Jun 2, 2024
5836d6c
refactor: Clean up constants and simplify the custom hf hub api
teleprint-me Jun 2, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
chore: Add model metadata
  • Loading branch information
teleprint-me committed May 22, 2024
commit cd00be886f417cace4e74f33371af5bcff375bfe
153 changes: 103 additions & 50 deletions gguf-py/gguf/constants.py
Original file line number Diff line number Diff line change
@@ -992,10 +992,14 @@ class HFModelFileType(IntEnum):
)

# NOTE: GPT-2 is the standard default pre-tokenizer for all models
# NOTE: BERT models inherit from the Byte Level Pre-tokenizer.
# https://github.com/huggingface/tokenizers/blob/main/tokenizers/src/pre_tokenizers/byte_level.rs#L117
# https://github.com/huggingface/tokenizers/blob/main/tokenizers/src/pre_tokenizers/bert.rs#L13
BPE_PRE_TOKENIZERS = {
# gpt2, olmo, phi (1, 1_5, 2, 3, ...)
"gpt2": (GPT_PRE_TOKENIZER_DEFAULT,),
# dbrx
# NOTE: PR#6920: https://github.com/ggerganov/llama.cpp/pull/6920#issuecomment-2080233989
"llama3": (
"(?:'[sS]|'[tT]|'[rR][eE]|'[vV][eE]|'[mM]|'[lL][lL]|'[dD])|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+",
),
@@ -1033,7 +1037,7 @@ class HFModelFileType(IntEnum):
# This will get out of control if not properly managed.
# This needs a proper solution. The short-term solution is to manually build a map here.
# A proper long-term solution would be to build a dynamic registry.
# The issue is that this requires a mapping or a database.
# The issue is that this requires a dynamically persistent mapping or a database.
# Possible solutions are to use JSON, HDF5, or SQLite.
# Some of these mappings could be dynamically generated, but it's sketchy at best.
# Model versions should be included along with the model name to mitigate name conflicts.
@@ -1060,14 +1064,14 @@ class HFModelFileType(IntEnum):
# - Possible algorithms are WordLevel, BPE, WordPiece, or Unigram
# - Possible LLaMa tokenizer model types are: None, SPM, BPE, or WPM
HF_MODEL_MAP = (
Copy link
Contributor Author

@teleprint-me teleprint-me May 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mofosyne I'm thinking about adding a registry. This way we can register the necessary metadata for each model. The easiest way I can think about doing it for now is to write it to a JSON file. What do you think?

# Sentence Piece Models
# SPM (Sentence Piece Models): Default to Byte Level Pre-tokenization.
{
"model_repo": "meta-llama/Llama-2-7b-hf",
"model_arch": MODEL_ARCH_NAMES[MODEL_ARCH.LLAMA],
"model_parts": 2,
"model_type": HFModelFileType.SFT,
"vocab_type": LLaMaVocabType.SPM,
"vocab_pre": (),
"vocab_pre": GPT_PRE_TOKENIZER_DEFAULT,
"vocab_files": HF_TOKENIZER_SPM_FILES,
},
{
@@ -1076,7 +1080,7 @@ class HFModelFileType(IntEnum):
"model_parts": 3,
"model_type": HFModelFileType.SFT,
"vocab_type": LLaMaVocabType.SPM,
"vocab_pre": (),
"vocab_pre": GPT_PRE_TOKENIZER_DEFAULT,
"vocab_files": HF_TOKENIZER_SPM_FILES,
},
{
@@ -1085,7 +1089,7 @@ class HFModelFileType(IntEnum):
"model_parts": 8,
"model_type": HFModelFileType.SFT,
"vocab_type": LLaMaVocabType.SPM,
"vocab_pre": (),
"vocab_pre": GPT_PRE_TOKENIZER_DEFAULT,
"vocab_files": HF_TOKENIZER_SPM_FILES,
},
{
@@ -1094,35 +1098,37 @@ class HFModelFileType(IntEnum):
"model_parts": 2,
"model_type": HFModelFileType.SFT,
"vocab_type": LLaMaVocabType.SPM,
"vocab_pre": (),
"vocab_pre": GPT_PRE_TOKENIZER_DEFAULT,
"vocab_files": HF_TOKENIZER_SPM_FILES,
},
# Word Piece Models
# WPM (Word Piece Models): Default to Byte Level Pre-tokenization.
# NOTE: BERT Normalization and Pre-tokenization rules differ from Byte Level Pre-tokenization.
{
"model_repo": "BAAI/bge-small-en-v1.5",
"model_arch": MODEL_ARCH_NAMES[MODEL_ARCH.BERT],
"model_parts": 1,
"model_type": HFModelFileType.BIN,
"vocab_type": LLaMaVocabType.WPM,
"vocab_pre": (),
"vocab_pre": GPT_PRE_TOKENIZER_DEFAULT,
"vocab_files": HF_TOKENIZER_BPE_FILES,
},
{
"model_repo": "jinaai/jina-embeddings-v2-base-en",
"model_arch": MODEL_ARCH.JINA_BERT_V2,
"model_arch": MODEL_ARCH_NAMES[MODEL_ARCH.JINA_BERT_V2],
"model_parts": 1,
"model_type": HFModelFileType.SFT,
"vocab_type": LLaMaVocabType.WPM,
"vocab_pre": GPT_PRE_TOKENIZER_DEFAULT,
"vocab_files": HF_TOKENIZER_BPE_FILES,
},
# Byte Pair Encoding Models
# BPE (Byte Pair Encoding Models): Default is Byte Level Pre-tokenization
{
"model_repo": "meta-llama/Meta-Llama-3-8B",
"model_arch": MODEL_ARCH.LLAMA,
"model_parts": 4,
"model_type": HFModelFileType.SFT,
"vocab_type": LLaMaVocabType.BPE,
# PR#6920: https://github.com/ggerganov/llama.cpp/pull/6920#issuecomment-2080233989
"vocab_pre": (
"(?:'[sS]|'[tT]|'[rR][eE]|'[vV][eE]|'[mM]|'[lL][lL]|'[dD])|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+",
),
"vocab_pre": BPE_PRE_TOKENIZERS["llama3"],
"vocab_files": HF_TOKENIZER_BPE_FILES,
},
{
@@ -1131,7 +1137,7 @@ class HFModelFileType(IntEnum):
"model_parts": 2,
"model_type": HFModelFileType.BIN,
"vocab_type": LLaMaVocabType.BPE,
"vocab_pre": BPE_PRE_TOKENIZERS[MODEL_ARCH_NAMES[MODEL_ARCH.FALCON]],
"vocab_pre": BPE_PRE_TOKENIZERS["falcon"],
"vocab_files": HF_TOKENIZER_BPE_FILES,
},
{
@@ -1140,14 +1146,7 @@ class HFModelFileType(IntEnum):
"model_parts": 2,
"model_type": HFModelFileType.BIN,
"vocab_type": LLaMaVocabType.BPE,
"vocab_pre": (
"[\r\n]",
"\\s?[A-Za-zµÀ-ÖØ-öø-ƺƼ-ƿDŽ-ʓʕ-ʯͰ-ͳͶͷͻ-ͽͿΆΈ-ΊΌΎ-ΡΣ-ϵϷ-ҁҊ-ԯԱ-ՖႠ-ჅᎠ-Ᏽᏸ-ᏽᲐ-ᲺᲽ-Ჿᴀ-ᴫᵫ-ᵷᵹ-ᶚḀ-ἕἘ-Ἕἠ-ὅὈ-Ὅὐ-ὗὙὛὝὟ-ώᾀ-ᾴᾶ-ᾼιῂ-ῄῆ-ῌῐ-ΐῖ-Ίῠ-Ῥῲ-ῴῶ-ῼℂℇℊ-ℓℕℙ-ℝℤΩℨK-ℭℯ-ℴℹℼ-ℿⅅ-ⅉⅎↃↄⰀ-ⱻⱾ-ⳤⳫ-ⳮⳲⳳꙀ-ꙭꚀ-ꚛꜢ-ꝯꝱ-ꞇꞋ-ꞎꭰ-ꮿff-stﬓ-ﬗA-Za-z𐐀-𐑏𐒰-𐓓𐓘-𐓻𐲀-𐲲𐳀-𐳲𑢠-𑣟𞤀-𞥃]+",
"\\s?[!-/:-~!-/:-~‘-‟ -。]+",
"\\s+$",
"[一-龥ࠀ-一가-퟿]+",
"\\p{N}+",
),
"vocab_pre": BPE_PRE_TOKENIZERS["deepseek"],
"vocab_files": HF_TOKENIZER_BPE_FILES,
},
{
@@ -1156,13 +1155,7 @@ class HFModelFileType(IntEnum):
"model_parts": 2,
"model_type": HFModelFileType.SFT,
"vocab_type": LLaMaVocabType.BPE,
"vocab_pre": (
"[\r\n]",
"\\s?\\p{L}+",
"\\s?\\p{P}+",
"[一-龥ࠀ-一가-퟿]+",
"\\p{N}",
),
"vocab_pre": BPE_PRE_TOKENIZERS["deepseek-coder"],
"vocab_files": HF_TOKENIZER_BPE_FILES,
},
{
@@ -1171,74 +1164,134 @@ class HFModelFileType(IntEnum):
"model_parts": 2,
"model_type": HFModelFileType.BIN,
"vocab_type": LLaMaVocabType.BPE,
"vocab_pre": (
"\\s?\\p{L}+",
"\\s?\\p{P}+",
"'s|'t|'re|'ve|'m|'ll|'d| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)",
),
"vocab_pre": BPE_PRE_TOKENIZERS["mpt"],
"vocab_files": HF_TOKENIZER_BPE_FILES,
},
#
# BPE: STARCODER
#
{
"model_repo": "bigcode/starcoder2-3b",
"model_arch": MODEL_ARCH.STARCODER2,
"model_parts": 1,
"model_type": HFModelFileType.SFT,
"vocab_type": LLaMaVocabType.BPE,
"vocab_pre": (
"\\p{N}",
"'s|'t|'re|'ve|'m|'ll|'d| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)",
),
"vocab_pre": BPE_PRE_TOKENIZERS["starcoder"],
"vocab_files": HF_TOKENIZER_BPE_FILES,
},
{
"model_repo": "openai-community/gpt2",
"model_arch": MODEL_ARCH.GPT2,
"vocab_type": LLaMaVocabType.BPE,
},
{
"model_repo": "smallcloudai/Refact-1_6-base",
"model_arch": MODEL_ARCH.REFACT,
"model_parts": 1,
"model_type": HFModelFileType.BIN,
"vocab_type": LLaMaVocabType.BPE,
"vocab_pre": BPE_PRE_TOKENIZERS["starcoder"],
"vocab_files": HF_TOKENIZER_BPE_FILES,
},
{
"model_repo": "CohereForAI/c4ai-command-r-v01",
"model_arch": MODEL_ARCH.COMMAND_R,
"model_parts": 15,
"model_type": HFModelFileType.SFT,
"vocab_type": LLaMaVocabType.BPE,
"vocab_pre": BPE_PRE_TOKENIZERS["starcoder"],
"vocab_files": HF_TOKENIZER_BPE_FILES,
},
#
# BPE: QWEN
#
{
"model_repo": "Qwen/Qwen1.5-7B",
"model_arch": MODEL_ARCH.QWEN2,
"model_parts": 4,
"model_type": HFModelFileType.SFT,
"vocab_type": LLaMaVocabType.BPE,
"vocab_pre": BPE_PRE_TOKENIZERS["qwen"],
"vocab_files": HF_TOKENIZER_BPE_FILES,
},
{
"model_repo": "stabilityai/stablelm-2-zephyr-1_6b",
"model_arch": MODEL_ARCH.STABLELM,
"model_parts": 1,
"model_type": HFModelFileType.SFT,
"vocab_type": LLaMaVocabType.BPE,
"vocab_pre": BPE_PRE_TOKENIZERS["qwen"],
"vocab_files": HF_TOKENIZER_BPE_FILES,
},
#
# BPE: GPT-2
#
{
"model_repo": "openai-community/gpt2",
"model_arch": MODEL_ARCH.GPT2,
"model_parts": 1,
"model_type": HFModelFileType.SFT,
"vocab_type": LLaMaVocabType.BPE,
"vocab_pre": BPE_PRE_TOKENIZERS["gpt2"],
"vocab_files": HF_TOKENIZER_BPE_FILES,
},
{
"model_repo": "allenai/OLMo-1.7-7B-hf",
"model_arch": MODEL_ARCH.OLMO,
"model_parts": 6,
"model_type": HFModelFileType.SFT,
"vocab_type": LLaMaVocabType.BPE,
"vocab_pre": BPE_PRE_TOKENIZERS["gpt2"],
"vocab_files": HF_TOKENIZER_BPE_FILES,
},
{
{ # NOTE: I don't have access to this model
"model_repo": "databricks/dbrx-base",
"model_arch": MODEL_ARCH.DBRX,
"model_parts": 0,
"model_type": HFModelFileType.SFT,
"vocab_type": LLaMaVocabType.BPE,
"vocab_pre": BPE_PRE_TOKENIZERS["gpt2"],
"vocab_files": HF_TOKENIZER_BPE_FILES,
},
{
{ # NOTE: RoBERTa post processor
"model_repo": "jinaai/jina-embeddings-v2-base-es",
"model_arch": MODEL_ARCH.JINA_BERT_V2,
"model_parts": 1,
"model_type": HFModelFileType.SFT,
"vocab_type": LLaMaVocabType.BPE,
"vocab_pre": BPE_PRE_TOKENIZERS["gpt2"],
"vocab_files": HF_TOKENIZER_BPE_FILES,
},
{
{ # NOTE: RoBERTa post processor
"model_repo": "jinaai/jina-embeddings-v2-base-de",
"model_arch": MODEL_ARCH.JINA_BERT_V2,
"model_parts": 1,
"model_type": HFModelFileType.SFT,
"vocab_type": LLaMaVocabType.BPE,
"vocab_pre": BPE_PRE_TOKENIZERS["gpt2"],
"vocab_files": HF_TOKENIZER_BPE_FILES,
},
{
{ # NOTE: Phi-1 is compatible with GPT-2 arch and vocab
"model_repo": "microsoft/phi-1",
"model_arch": MODEL_ARCH.PHI2,
"model_parts": 1,
"model_type": HFModelFileType.SFT,
"vocab_type": LLaMaVocabType.BPE,
"vocab_pre": BPE_PRE_TOKENIZERS["gpt2"],
"vocab_files": HF_TOKENIZER_BPE_FILES,
},
{
"model_repo": "stabilityai/stablelm-2-zephyr-1_6b",
"model_arch": MODEL_ARCH.STABLELM,
"model_repo": "microsoft/phi-1_5",
"model_arch": MODEL_ARCH.PHI2,
"model_parts": 1,
"model_type": HFModelFileType.SFT,
"vocab_type": LLaMaVocabType.BPE,
"vocab_pre": BPE_PRE_TOKENIZERS["gpt2"],
"vocab_files": HF_TOKENIZER_BPE_FILES,
},
{
"model_repo": "microsoft/phi-2",
"model_arch": MODEL_ARCH.PHI2,
"model_parts": 2,
"model_type": HFModelFileType.SFT,
"vocab_type": LLaMaVocabType.BPE,
"vocab_pre": BPE_PRE_TOKENIZERS["gpt2"],
"vocab_files": HF_TOKENIZER_BPE_FILES,
},
)