Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

convert.py: add python logging instead of print() #6511

Merged
merged 36 commits into from
May 3, 2024
Merged
Show file tree
Hide file tree
Changes from 34 commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
573dcec
convert.py: add python logging instead of print()
mofosyne Apr 6, 2024
88c1e2f
convert.py: verbose flag takes priority over dump flag log suppression
mofosyne Apr 6, 2024
e8be0c8
convert.py: named instance logging
mofosyne Apr 8, 2024
8008082
convert.py: use explicit logger id string
mofosyne Apr 8, 2024
e6b9d91
convert.py: convert extra print() to named logger
mofosyne Apr 8, 2024
3670e16
convert.py: sys.stderr.write --> logger.error
mofosyne Apr 9, 2024
f00454f
*.py: Convert all python scripts to use logging module
mofosyne Apr 10, 2024
9ad587a
requirements.txt: remove extra line
mofosyne Apr 10, 2024
c220e35
flake8: update flake8 ignore and exclude to match ci settings
mofosyne Apr 10, 2024
8d855b1
gh-actions: add flake8-no-print to flake8 lint step
mofosyne Apr 10, 2024
dd8b977
pre-commit: add flake8-no-print to flake8 and also update pre-commit …
mofosyne Apr 10, 2024
44b058d
convert-hf-to-gguf.py: print() to logger conversion
mofosyne Apr 15, 2024
1cc38d8
*.py: logging basiconfig refactor to use conditional expression
mofosyne Apr 17, 2024
c2e5abd
*.py: removed commented out logging
mofosyne Apr 18, 2024
dc2bff4
fixup! *.py: logging basiconfig refactor to use conditional expression
mofosyne Apr 18, 2024
cf38b4b
constant.py: logger.error then exit should be a raise exception instead
mofosyne Apr 18, 2024
dc798d2
*.py: Convert logger error and sys.exit() into a raise exception (for…
mofosyne Apr 18, 2024
ea44905
gguf-convert-endian.py: refactor convert_byteorder() to use tqdm prog…
mofosyne Apr 18, 2024
e0372a1
verify-checksum-model.py: This is the result of the program, it shoul…
mofosyne Apr 18, 2024
510dea0
compare-llama-bench.py: add blank line for readability during missing…
mofosyne Apr 18, 2024
62da83a
reader.py: read_gguf_file() use print() over logging
mofosyne Apr 18, 2024
1b1c2ed
convert.py: warning goes to stderr and won't hurt the dump output
mofosyne Apr 18, 2024
3a55ae4
gguf-dump.py: dump_metadata() should print to stdout
mofosyne Apr 18, 2024
aefd749
convert-hf-to-gguf.py: print --> logger.debug or ValueError()
mofosyne Apr 18, 2024
1b7c800
verify-checksum-models.py: use print() for printing table
mofosyne Apr 18, 2024
b0b51e7
*.py: refactor logging.basicConfig()
mofosyne Apr 18, 2024
fe1d7f6
gguf-py/gguf/*.py: use __name__ as logger name
mofosyne Apr 18, 2024
ad53853
python-lint.yml: use .flake8 file instead
mofosyne Apr 18, 2024
58d5a5d
constants.py: logger no longer required
mofosyne Apr 21, 2024
2d2bc99
convert-hf-to-gguf.py: add additional logging
mofosyne Apr 21, 2024
5e5e74e
convert-hf-to-gguf.py: print() --> logger
mofosyne Apr 24, 2024
fcc5a5e
*.py: fix flake8 warnings
mofosyne Apr 29, 2024
6d42f3d
revert changes to convert-hf-to-gguf.py for get_name()
mofosyne May 1, 2024
154ad12
convert-hf-to-gguf-update.py: use triple quoted f-string instead
mofosyne May 1, 2024
08e2b77
*.py: accidentally corrected the wrong line
mofosyne May 2, 2024
52d0567
*.py: add compilade warning suggestions and style fixes
mofosyne May 3, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .flake8
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
[flake8]
max-line-length = 125
ignore = W503
ignore = E203,E211,E221,E225,E231,E241,E251,E261,E266,E501,E701,E704,W503
exclude = examples/*,examples/*/**,*/**/__init__.py
3 changes: 1 addition & 2 deletions .github/workflows/python-lint.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,5 +20,4 @@ jobs:
- name: flake8 Lint
uses: py-actions/flake8@v2
with:
ignore: "E203,E211,E221,E225,E231,E241,E251,E261,E266,E501,E701,E704,W503"
exclude: "examples/*,examples/*/**,*/**/__init__.py,convert-hf-to-gguf-update.py"
plugins: "flake8-no-print"
5 changes: 3 additions & 2 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,13 +3,14 @@
exclude: prompts/.*.txt
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v3.2.0
rev: v4.6.0
hooks:
- id: trailing-whitespace
- id: end-of-file-fixer
- id: check-yaml
- id: check-added-large-files
- repo: https://github.com/PyCQA/flake8
rev: 6.0.0
rev: 7.0.0
hooks:
- id: flake8
additional_dependencies: [flake8-no-print]
157 changes: 82 additions & 75 deletions convert-hf-to-gguf-update.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,56 +21,64 @@
# TODO: automate the update of convert-hf-to-gguf.py
#

import logging
import os
import requests
import sys
import json

from hashlib import sha256
from enum import IntEnum, auto
from transformers import AutoTokenizer

logger = logging.getLogger("convert-hf-to-gguf-update")


class TOKENIZER_TYPE(IntEnum):
SPM = auto()
BPE = auto()
WPM = auto()


# TODO: this string has to exercise as much pre-tokenizer functionality as possible
# will be updated with time - contributions welcome
chktxt = '\n \n\n \n\n\n \t \t\t \t\n \n \n \n \n🚀 (normal) 😶‍🌫️ (multiple emojis concatenated) ✅ 🦙🦙 3 33 333 3333 33333 333333 3333333 33333333 3.3 3..3 3...3 កាន់តែពិសេសអាច😁 ?我想在apple工作1314151天~ ------======= нещо на Български \'\'\'\'\'\'```````\"\"\"\"......!!!!!!?????? I\'ve been \'told he\'s there, \'RE you sure? \'M not sure I\'ll make it, \'D you like some tea? We\'Ve a\'lL'

if len(sys.argv) == 2:
token = sys.argv[1]
else:
print("Usage: python convert-hf-to-gguf-update.py <huggingface_token>")
logger.info("Usage: python convert-hf-to-gguf-update.py <huggingface_token>")
sys.exit(1)

# TODO: add models here, base models preferred
models = [
{ "name": "llama-spm", "tokt": TOKENIZER_TYPE.SPM, "repo": "https://huggingface.co/meta-llama/Llama-2-7b-hf", },
{ "name": "llama-bpe", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/meta-llama/Meta-Llama-3-8B", },
{ "name": "phi-3", "tokt": TOKENIZER_TYPE.SPM, "repo": "https://huggingface.co/microsoft/Phi-3-mini-4k-instruct", },
{ "name": "deepseek-llm", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/deepseek-ai/deepseek-llm-7b-base", },
{ "name": "deepseek-coder", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/deepseek-ai/deepseek-coder-6.7b-base", },
{ "name": "falcon", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/tiiuae/falcon-7b", },
{ "name": "bert-bge", "tokt": TOKENIZER_TYPE.WPM, "repo": "https://huggingface.co/BAAI/bge-small-en-v1.5", },
{ "name": "mpt", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/mosaicml/mpt-7b", },
{ "name": "starcoder", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/bigcode/starcoder2-3b", },
{ "name": "gpt-2", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/openai-community/gpt2", },
]
{"name": "llama-spm", "tokt": TOKENIZER_TYPE.SPM, "repo": "https://huggingface.co/meta-llama/Llama-2-7b-hf", },
{"name": "llama-bpe", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/meta-llama/Meta-Llama-3-8B", },
{"name": "phi-3", "tokt": TOKENIZER_TYPE.SPM, "repo": "https://huggingface.co/microsoft/Phi-3-mini-4k-instruct", },
{"name": "deepseek-llm", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/deepseek-ai/deepseek-llm-7b-base", },
{"name": "deepseek-coder", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/deepseek-ai/deepseek-coder-6.7b-base", },
{"name": "falcon", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/tiiuae/falcon-7b", },
{"name": "bert-bge", "tokt": TOKENIZER_TYPE.WPM, "repo": "https://huggingface.co/BAAI/bge-small-en-v1.5", },
{"name": "mpt", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/mosaicml/mpt-7b", },
{"name": "starcoder", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/bigcode/starcoder2-3b", },
{"name": "gpt-2", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/openai-community/gpt2", },
]

# make directory "models/tokenizers" if it doesn't exist
if not os.path.exists("models/tokenizers"):
os.makedirs("models/tokenizers")


def download_file_with_auth(url, token, save_path):
headers = {"Authorization": f"Bearer {token}"}
response = requests.get(url, headers=headers)
if response.status_code == 200:
with open(save_path, 'wb') as f:
f.write(response.content)
print(f"File {save_path} downloaded successfully")
logger.info(f"File {save_path} downloaded successfully")
else:
print(f"Failed to download file. Status code: {response.status_code}")
logger.info(f"Failed to download file. Status code: {response.status_code}")


# download the tokenizer models
for model in models:
Expand All @@ -81,10 +89,10 @@ def download_file_with_auth(url, token, save_path):
if not os.path.exists(f"models/tokenizers/{name}"):
os.makedirs(f"models/tokenizers/{name}")
else:
print(f"Directory models/tokenizers/{name} already exists - skipping")
logger.info(f"Directory models/tokenizers/{name} already exists - skipping")
continue

print(f"Downloading {name} to models/tokenizers/{name}")
logger.info(f"Downloading {name} to models/tokenizers/{name}")

url = f"{repo}/raw/main/config.json"
save_path = f"models/tokenizers/{name}/config.json"
Expand Down Expand Up @@ -115,76 +123,76 @@ def download_file_with_auth(url, token, save_path):
continue

# create the tokenizer
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(f"models/tokenizers/{name}")

chktok = tokenizer.encode(chktxt)
chkhsh = sha256(str(chktok).encode()).hexdigest()

print(f"model: {name}")
print(f"tokt: {tokt}")
print(f"repo: {model['repo']}")
print(f"chktok: {chktok}")
print(f"chkhsh: {chkhsh}")
logger.info(f"model: {name}")
logger.info(f"tokt: {tokt}")
logger.info(f"repo: {model['repo']}")
logger.info(f"chktok: {chktok}")
logger.info(f"chkhsh: {chkhsh}")

# print the "pre_tokenizer" content from the tokenizer.json
with open(f"models/tokenizers/{name}/tokenizer.json", "r", encoding="utf-8") as f:
cfg = json.load(f)
pre_tokenizer = cfg["pre_tokenizer"]
print("pre_tokenizer: " + json.dumps(pre_tokenizer, indent=4))
logger.info("pre_tokenizer: " + json.dumps(pre_tokenizer, indent=4))

print(f"\n")
logger.info("")

src_ifs += f" if chkhsh == \"{chkhsh}\":\n"
src_ifs += f" # ref: {model['repo']}\n"
src_ifs += f" res = \"{name}\"\n"

src_func = ""
src_func += " def get_vocab_base_pre(self, tokenizer) -> str:\n"
src_func += " # encoding this string and hashing the resulting tokens would (hopefully) give us a unique identifier that\n"
src_func += " # is specific for the BPE pre-tokenizer used by the model\n"
src_func += " # we will use this unique identifier to write a \"tokenizer.ggml.pre\" entry in the GGUF file which we can\n"
src_func += " # use in llama.cpp to implement the same pre-tokenizer\n"
src_func += "\n"
src_func += f" chktxt = {repr(chktxt)}\n"
src_func += "\n"
src_func += " chktok = tokenizer.encode(chktxt)\n"
src_func += " chkhsh = sha256(str(chktok).encode()).hexdigest()\n"
src_func += "\n"
src_func += " print(f\"chktok: {chktok}\")\n"
src_func += " print(f\"chkhsh: {chkhsh}\")\n"
src_func += "\n"
src_func += " res = None\n"
src_func += "\n"
src_func += " # NOTE: if you get an error here, you need to update the convert-hf-to-gguf-update.py script\n"
src_func += " # or pull the latest version of the model from Huggingface\n"
src_func += " # don't edit the hashes manually!\n"
src_func += f"{src_ifs}\n"
src_func += " if res is None:\n"
src_func += " print(\"\\n\")\n"
src_func += " print(\"**************************************************************************************\")\n"
src_func += " print(\"** WARNING: The BPE pre-tokenizer was not recognized!\")\n"
src_func += " print(\"** There are 2 possible reasons for this:\")\n"
src_func += " print(\"** - the model has not been added to convert-hf-to-gguf-update.py yet\")\n"
src_func += " print(\"** - the pre-tokenization config has changed upstream\")\n"
src_func += " print(\"** Check your model files and convert-hf-to-gguf-update.py and update them accordingly.\")\n"
src_func += " print(\"** ref: https://github.com/ggerganov/llama.cpp/pull/6920\")\n"
src_func += " print(\"**\")\n"
src_func += " print(f\"** chkhsh: {chkhsh}\")\n"
src_func += " print(\"**************************************************************************************\")\n"
src_func += " print(\"\\n\")\n"
src_func += " raise NotImplementedError(\"BPE pre-tokenizer was not recognized - update get_vocab_base_pre()\")\n"
src_func += "\n"
src_func += " print(f\"tokenizer.ggml.pre: {res}\")\n"
src_func += " print(f\"chkhsh: {chkhsh}\")\n"
src_func += "\n"
src_func += " return res\n"

print(src_func)

print("\n")
print("!!! Copy-paste the function above into convert-hf-to-gguf.py !!!")
print("\n")
src_func = f"""
def get_vocab_base_pre(self, tokenizer) -> str:
# encoding this string and hashing the resulting tokens would (hopefully) give us a unique identifier that
# is specific for the BPE pre-tokenizer used by the model
# we will use this unique identifier to write a "tokenizer.ggml.pre" entry in the GGUF file which we can
# use in llama.cpp to implement the same pre-tokenizer

chktxt = {repr(chktxt)}

chktok = tokenizer.encode(chktxt)
chkhsh = sha256(str(chktok).encode()).hexdigest()

print(f"chktok: {{chktok}}")
print(f"chkhsh: {{chkhsh}}")

res = None

# NOTE: if you get an error here, you need to update the convert-hf-to-gguf-update.py script
# or pull the latest version of the model from Huggingface
# don't edit the hashes manually!
{src_ifs}
if res is None:
print("\\n")
print("**************************************************************************************")
print("** WARNING: The BPE pre-tokenizer was not recognized!")
print("** There are 2 possible reasons for this:")
print("** - the model has not been added to convert-hf-to-gguf-update.py yet")
print("** - the pre-tokenization config has changed upstream")
print("** Check your model files and convert-hf-to-gguf-update.py and update them accordingly.")
print("** ref: https://github.com/ggerganov/llama.cpp/pull/6920")
print("**")
print(f"** chkhsh: {{chkhsh}}")
print("**************************************************************************************")
print("\\n")
raise NotImplementedError("BPE pre-tokenizer was not recognized - update get_vocab_base_pre()")

print(f"tokenizer.ggml.pre: {{repr(res)}}")
print(f"chkhsh: {{chkhsh}}")

return res
"""

print(src_func) # noqa: NP100

logger.info("\n")
logger.info("!!! Copy-paste the function above into convert-hf-to-gguf.py !!!")
logger.info("\n")

# generate tests for each tokenizer model

Expand Down Expand Up @@ -250,7 +258,6 @@ def download_file_with_auth(url, token, save_path):
tokt = model["tokt"]

# create the tokenizer
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(f"models/tokenizers/{name}")

with open(f"models/ggml-vocab-{name}.gguf.inp", "w", encoding="utf-8") as f:
Expand All @@ -265,15 +272,15 @@ def download_file_with_auth(url, token, save_path):
f.write(f" {r}")
f.write("\n")

print(f"Tests for {name} written in ./models/ggml-vocab-{name}.gguf.*")
logger.info(f"Tests for {name} written in ./models/ggml-vocab-{name}.gguf.*")

# generate commands for creating vocab files

print("\nRun the following commands to generate the vocab files for testing:\n")
logger.info("\nRun the following commands to generate the vocab files for testing:\n")

for model in models:
name = model["name"]

print(f"python3 convert-hf-to-gguf.py models/tokenizers/{name}/ --outfile models/ggml-vocab-{name}.gguf --vocab-only")
logger.info(f"python3 convert-hf-to-gguf.py models/tokenizers/{name}/ --outfile models/ggml-vocab-{name}.gguf --vocab-only")

print("\n")
logger.info("\n")
Loading
Loading