Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama : improve BPE pre-processing + LLaMA 3 and Deepseek support #6920

Merged
merged 61 commits into from
Apr 29, 2024
Merged
Show file tree
Hide file tree
Changes from 50 commits
Commits
Show all changes
61 commits
Select commit Hold shift + click to select a range
6fbab2d
merged the changes from deepseeker models to main branch
jaggzh Feb 12, 2024
d2cfc22
Moved regex patterns to unicode.cpp and updated unicode.h
dragnil1 Mar 22, 2024
54f93eb
Moved header files
dragnil1 Mar 22, 2024
1c924e4
Resolved issues
dragnil1 Mar 23, 2024
4056dc5
added and refactored unicode_regex_split and related functions
dragnil1 Mar 31, 2024
c8e7d95
Updated/merged the deepseek coder pr
jaggzh Feb 12, 2024
4c3e882
Refactored code
dragnil1 Apr 13, 2024
a5710a4
Adding unicode regex mappings
dragnil1 Apr 15, 2024
7e308ed
Adding unicode regex function
dragnil1 Apr 15, 2024
feeaf4f
Added needed functionality, testing remains
dragnil1 Apr 15, 2024
7535803
Fixed issues
dragnil1 Apr 15, 2024
36d9832
Fixed issue with gpt2 regex custom preprocessor
dragnil1 Apr 17, 2024
06d3e69
unicode : fix? unicode_wstring_to_utf8
ggerganov Apr 26, 2024
c56e19d
lint : fix whitespaces
ggerganov Apr 26, 2024
7a44e44
tests : add tokenizer tests for numbers
ggerganov Apr 26, 2024
d999cf6
unicode : remove redundant headers
ggerganov Apr 26, 2024
aeafb43
tests : remove and rename tokenizer test scripts
ggerganov Apr 26, 2024
e1b2bf7
tests : add sample usage
ggerganov Apr 26, 2024
ed42711
gguf-py : reader prints warnings on duplicate keys
ggerganov Apr 26, 2024
4907e41
llama : towards llama3 tokenization support (wip)
ggerganov Apr 26, 2024
e8c206b
unicode : shot in the dark to fix tests on Windows
ggerganov Apr 26, 2024
e989176
unicode : first try custom implementations
ggerganov Apr 26, 2024
e3f6dc7
Merge branch 'master' into gg/bpe-preprocess
ggerganov Apr 26, 2024
9b4d63a
convert : add "tokenizer.ggml.pre" GGUF KV (wip)
ggerganov Apr 26, 2024
43e12ce
llama : use new pre-tokenizer type
ggerganov Apr 26, 2024
1b9b79d
convert : fix pre-tokenizer type writing
ggerganov Apr 26, 2024
8791e94
lint : fix
ggerganov Apr 26, 2024
a774d70
make : add test-tokenizer-0-llama-v3
ggerganov Apr 26, 2024
c160818
wip
ggerganov Apr 26, 2024
96965f6
models : add llama v3 vocab file
ggerganov Apr 27, 2024
ad92983
llama : adapt punctuation regex + add llama 3 regex
ggerganov Apr 27, 2024
4434c9d
minor
ggerganov Apr 27, 2024
a22645c
unicode : set bomb
ggerganov Apr 27, 2024
2affd0b
unicode : set bomb
ggerganov Apr 27, 2024
ce5485a
unicode : always use std::wregex
ggerganov Apr 27, 2024
91eaa41
unicode : support \p{N}, \p{L} and \p{P} natively
ggerganov Apr 27, 2024
581c4a0
unicode : try fix windows
ggerganov Apr 27, 2024
b97add5
unicode : category support via std::regex
ggerganov Apr 28, 2024
d63cc90
Merge branch 'master' into gg/bpe-preprocess
ggerganov Apr 28, 2024
e972e6c
unicode : clean-up
ggerganov Apr 28, 2024
ee6d1b3
unicode : simplify
ggerganov Apr 28, 2024
7642973
convert : add convert-hf-to-gguf-update.py
ggerganov Apr 28, 2024
4e3e6d8
lint : update
ggerganov Apr 28, 2024
1c888eb
convert : add falcon
ggerganov Apr 28, 2024
1545550
unicode : normalize signatures
ggerganov Apr 28, 2024
491f233
lint : fix
ggerganov Apr 28, 2024
e8dd4a1
lint : fix
ggerganov Apr 28, 2024
02fd977
convert : remove unused functions
ggerganov Apr 28, 2024
0f9058c
convert : add comments
ggerganov Apr 28, 2024
7808150
convert : exercise contractions
ggerganov Apr 28, 2024
7b1210f
lint : fix
ggerganov Apr 28, 2024
ef4cca9
cmake : refactor test targets
ggerganov Apr 29, 2024
43708d2
tests : refactor vocab tests
ggerganov Apr 29, 2024
c68d259
tests : add more vocabs and tests
ggerganov Apr 29, 2024
af05268
unicode : cleanup
ggerganov Apr 29, 2024
c21ab18
scripts : ignore new update script in check-requirements.sh
ggerganov Apr 29, 2024
120cf37
models : add phi-3, mpt, gpt-2, starcoder
ggerganov Apr 29, 2024
9a7d430
tests : disable obsolete
ggerganov Apr 29, 2024
6d6ce93
tests : use faster bpe test
ggerganov Apr 29, 2024
3202676
llama : more prominent warning for old BPE models
ggerganov Apr 29, 2024
80cb312
tests : disable test-tokenizer-1-bpe due to slowness
ggerganov Apr 29, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/python-lint.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,4 +21,4 @@ jobs:
uses: py-actions/flake8@v2
with:
ignore: "E203,E211,E221,E225,E231,E241,E251,E261,E266,E501,E701,E704,W503"
exclude: "examples/*,examples/*/**,*/**/__init__.py"
exclude: "examples/*,examples/*/**,*/**/__init__.py,convert-hf-to-gguf-update.py"
17 changes: 17 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -108,3 +108,20 @@ examples/server/*.mjs.hpp
poetry.lock
poetry.toml
nppBackup

# Test binaries
/tests/test-grammar-parser
/tests/test-llama-grammar
/tests/test-double-float
/tests/test-grad0
/tests/test-opt
/tests/test-quantize-fns
/tests/test-quantize-perf
/tests/test-sampling
/tests/test-tokenizer-0-llama
/tests/test-tokenizer-0-falcon
/tests/test-tokenizer-0-deepseek-coder
/tests/test-tokenizer-1-llama
/tests/test-tokenizer-1-bpe
/tests/test-rope
/tests/test-backend-ops
42 changes: 37 additions & 5 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,27 @@ BUILD_TARGETS = \

# Binaries only useful for tests
TEST_TARGETS = \
tests/test-llama-grammar tests/test-grammar-parser tests/test-double-float tests/test-grad0 tests/test-opt \
tests/test-quantize-fns tests/test-quantize-perf tests/test-sampling tests/test-tokenizer-0-llama \
tests/test-tokenizer-0-falcon tests/test-tokenizer-1-llama tests/test-tokenizer-1-bpe tests/test-rope \
tests/test-backend-ops tests/test-model-load-cancel tests/test-autorelease \
tests/test-json-schema-to-grammar tests/test-grammar-integration
tests/test-autorelease \
tests/test-backend-ops \
tests/test-double-float \
tests/test-grad0 \
tests/test-grammar-integration \
tests/test-grammar-parser \
tests/test-json-schema-to-grammar \
tests/test-llama-grammar \
tests/test-model-load-cancel \
tests/test-opt \
tests/test-quantize-fns \
tests/test-quantize-perf \
tests/test-rope \
tests/test-sampling \
tests/test-tokenizer-0-deepseek-coder \
tests/test-tokenizer-0-deepseek-llm \
tests/test-tokenizer-0-falcon \
tests/test-tokenizer-0-llama \
tests/test-tokenizer-0-llama-v3 \
tests/test-tokenizer-1-bpe \
tests/test-tokenizer-1-llama

# Code coverage output files
COV_TARGETS = *.gcno tests/*.gcno *.gcda tests/*.gcda *.gcov tests/*.gcov lcov-report gcovr-report
Expand Down Expand Up @@ -51,8 +67,14 @@ test: $(TEST_TARGETS)
for test_target in $(TEST_TARGETS); do \
if [ "$$test_target" = "tests/test-tokenizer-0-llama" ]; then \
./$$test_target $(CURDIR)/models/ggml-vocab-llama.gguf; \
elif [ "$$test_target" = "tests/test-tokenizer-0-llama-v3" ]; then \
./$$test_target $(CURDIR)/models/ggml-vocab-llama-v3.gguf; \
elif [ "$$test_target" = "tests/test-tokenizer-0-falcon" ]; then \
./$$test_target $(CURDIR)/models/ggml-vocab-falcon.gguf; \
elif [ "$$test_target" = "tests/test-tokenizer-0-deepseek-coder" ]; then \
./$$test_target $(CURDIR)/models/ggml-vocab-deepseek-coder.gguf; \
elif [ "$$test_target" = "tests/test-tokenizer-0-deepseek-llm" ]; then \
./$$test_target $(CURDIR)/models/ggml-vocab-deepseek-llm.gguf; \
elif [ "$$test_target" = "tests/test-tokenizer-1-llama" ]; then \
continue; \
elif [ "$$test_target" = "tests/test-tokenizer-1-bpe" ]; then \
Expand Down Expand Up @@ -979,6 +1001,16 @@ tests/test-tokenizer-0-llama: tests/test-tokenizer-0-llama.cpp ggml.o llama.o $(
$(CXX) $(CXXFLAGS) -c $< -o $(call GET_OBJ_FILE, $<)
$(CXX) $(CXXFLAGS) $(filter-out %.h $<,$^) $(call GET_OBJ_FILE, $<) -o $@ $(LDFLAGS)

tests/test-tokenizer-0-llama-v3: tests/test-tokenizer-0-llama-v3.cpp ggml.o llama.o $(COMMON_DEPS) console.o $(OBJS)
$(CXX) $(CXXFLAGS) -c $< -o $(call GET_OBJ_FILE, $<)
$(CXX) $(CXXFLAGS) $(filter-out %.h $<,$^) $(call GET_OBJ_FILE, $<) -o $@ $(LDFLAGS)

tests/test-tokenizer-0-deepseek-coder: tests/test-tokenizer-0-deepseek-coder.cpp ggml.o llama.o $(COMMON_DEPS) $(OBJS)
$(CXX) $(CXXFLAGS) $(filter-out %.h,$^) -o $@ $(LDFLAGS)

tests/test-tokenizer-0-deepseek-llm: tests/test-tokenizer-0-deepseek-llm.cpp ggml.o llama.o $(COMMON_DEPS) $(OBJS)
$(CXX) $(CXXFLAGS) $(filter-out %.h,$^) -o $@ $(LDFLAGS)

tests/test-tokenizer-1-bpe: tests/test-tokenizer-1-bpe.cpp ggml.o llama.o $(COMMON_DEPS) console.o $(OBJS)
$(CXX) $(CXXFLAGS) -c $< -o $(call GET_OBJ_FILE, $<)
$(CXX) $(CXXFLAGS) $(filter-out %.h $<,$^) $(call GET_OBJ_FILE, $<) -o $@ $(LDFLAGS)
Expand Down
12 changes: 12 additions & 0 deletions common/common.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1693,6 +1693,18 @@ std::vector<std::string> string_split(std::string input, char separator) {
return parts;
}

std::string string_strip(const std::string & str) {
size_t start = 0;
size_t end = str.size();
while (start < end && std::isspace(str[start])) {
start++;
}
while (end > start && std::isspace(str[end - 1])) {
end--;
}
return str.substr(start, end - start);
}

std::vector<llama_sampler_type> sampler_types_from_names(const std::vector<std::string> & names, bool allow_alt_names) {
std::unordered_map<std::string, llama_sampler_type> sampler_canonical_name_map {
{"top_k", llama_sampler_type::TOP_K},
Expand Down
1 change: 1 addition & 0 deletions common/common.h
Original file line number Diff line number Diff line change
Expand Up @@ -196,6 +196,7 @@ bool validate_file_name(const std::string & filename);
std::vector<llama_sampler_type> sampler_types_from_names(const std::vector<std::string> & names, bool allow_alt_names);
std::vector<llama_sampler_type> sampler_types_from_chars(const std::string & names_string);
std::vector<std::string> string_split(std::string input, char separator);
std::string string_strip(const std::string & str);
std::string sampler_type_to_name_string(llama_sampler_type sampler_type);

//
Expand Down
175 changes: 175 additions & 0 deletions convert-hf-to-gguf-update.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,175 @@
# This script downloads the tokenizer models of the specified models from Huggingface and
# generates the get_vocab_base_pre() function for convert-hf-to-gguf.py
#
# This is necessary in order to analyze the type of pre-tokenizer used by the model and
# provide the necessary information to llama.cpp via the GGUF header in order to implement
# the same pre-tokenizer.
#
# ref: https://github.com/ggerganov/llama.cpp/pull/6920
#
# Instructions:
#
# - Add a new model to the "models" list
# - Run the script with your huggingface token:
#
# python3 convert-hf-to-gguf-update.py <huggingface_token>
#
# - Copy-paste the generated get_vocab_base_pre() function into convert-hf-to-gguf.py
# - Update llama.cpp with the new pre-tokenizer if necessary
#
# TODO: generate tokenizer tests for llama.cpp
# TODO: automate the update of convert-hf-to-gguf.py
#

import os
import requests
import sys
import json

from hashlib import sha256
from enum import IntEnum, auto

class TOKENIZER_TYPE(IntEnum):
SPM = auto()
BPE = auto()
WPM = auto()

# TODO: this string has to exercise as much pre-tokenizer functionality as possible
# will be updated with time - contributions welcome
chktxt = '\n \n\n \n\n\n \t \t\t \t\n \n \n \n \n🚀 (normal) 😶‍🌫️ (multiple emojis concatenated) ✅ 🦙🦙 3 33 333 3333 33333 333333 3333333 33333333 3.3 3..3 3...3 កាន់តែពិសេសអាច😁 ?我想在apple工作1314151天~ ------======= нещо на Български \'\'\'\'\'\'```````\"\"\"\"......!!!!!!?????? I\'ve been \'told he\'s there, \'RE you sure? \'M not sure I\'ll make it, \'D you like some tea? We\'Ve a\'lL'

if len(sys.argv) == 2:
token = sys.argv[1]
else:
print("Usage: python convert-hf-to-gguf-update.py <huggingface_token>")
sys.exit(1)

# TODO: add models here, base models preferred
models = [
{ "name": "llama-v2", "tokt": TOKENIZER_TYPE.SPM, "repo": "https://huggingface.co/meta-llama/Llama-2-7b-hf", },
{ "name": "llama-v3", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/meta-llama/Meta-Llama-3-8B", },
{ "name": "deepseek-llm", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/deepseek-ai/deepseek-llm-7b-base", },
{ "name": "deepseek-coder", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/deepseek-ai/deepseek-coder-6.7b-base", },
{ "name": "falcon", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/tiiuae/falcon-7b", },
{ "name": "bert-bge", "tokt": TOKENIZER_TYPE.WPM, "repo": "https://huggingface.co/BAAI/bge-small-en-v1.5", },
]

# make directory "models/tokenizers" if it doesn't exist
if not os.path.exists("models/tokenizers"):
os.makedirs("models/tokenizers")

def download_file_with_auth(url, token, save_path):
headers = {"Authorization": f"Bearer {token}"}
response = requests.get(url, headers=headers)
if response.status_code == 200:
with open(save_path, 'wb') as f:
f.write(response.content)
print("File downloaded successfully.")
else:
print(f"Failed to download file. Status code: {response.status_code}")

for model in models:
name = model["name"]
repo = model["repo"]
tokt = model["tokt"]

if not os.path.exists(f"models/tokenizers/{name}"):
os.makedirs(f"models/tokenizers/{name}")
else:
print(f"Directory models/tokenizers/{name} already exists - skipping")
continue

print(f"Downloading {name} to models/tokenizers/{name}")

url = f"{repo}/raw/main/tokenizer.json"
save_path = f"models/tokenizers/{name}/tokenizer.json"
download_file_with_auth(url, token, save_path)

if tokt == TOKENIZER_TYPE.SPM:
url = f"{repo}/resolve/main/tokenizer.model"
save_path = f"models/tokenizers/{name}/tokenizer.model"
download_file_with_auth(url, token, save_path)

url = f"{repo}/raw/main/tokenizer_config.json"
save_path = f"models/tokenizers/{name}/tokenizer_config.json"
download_file_with_auth(url, token, save_path)

# generate the source code for the convert-hf-to-gguf.py:get_vocab_base_pre() function:
# TODO: auto-update convert-hf-to-gguf.py with the generated function

src_ifs = ""
for model in models:
name = model["name"]
tokt = model["tokt"]

if tokt == TOKENIZER_TYPE.SPM:
continue

# create the tokenizer
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(f"models/tokenizers/{name}")

chktok = tokenizer.encode(chktxt)
chkhsh = sha256(str(chktok).encode()).hexdigest()

print(f"model: {name}")
print(f"tokt: {tokt}")
print(f"repo: {model['repo']}")
print(f"chktok: {chktok}")
print(f"chkhsh: {chkhsh}")

# print the "pre_tokenizer" content from the tokenizer.json
with open(f"models/tokenizers/{name}/tokenizer.json", "r") as f:
cfg = json.load(f)
pre_tokenizer = cfg["pre_tokenizer"]
print("pre_tokenizer: " + json.dumps(pre_tokenizer, indent=4))

print(f"\n")

src_ifs += f" if chkhsh == \"{chkhsh}\":\n"
src_ifs += f" # ref: {model['repo']}\n"
src_ifs += f" res = \"{name}\"\n"

src_func = ""
src_func += " def get_vocab_base_pre(self, tokenizer) -> str:\n"
src_func += " # encoding this string and hashing the resulting tokens would (hopefully) give us a unique identifier that\n"
src_func += " # is specific for the BPE pre-tokenizer used by the model\n"
src_func += " # we will use this unique identifier to write a \"tokenizer.ggml.pre\" entry in the GGUF file which we can\n"
src_func += " # use in llama.cpp to implement the same pre-tokenizer\n"
src_func += "\n"
src_func += f" chktxt = {repr(chktxt)}\n"
src_func += "\n"
src_func += " chktok = tokenizer.encode(chktxt)\n"
src_func += " chkhsh = sha256(str(chktok).encode()).hexdigest()\n"
src_func += "\n"
src_func += " print(f\"chktok: {chktok}\")\n"
src_func += " print(f\"chkhsh: {chkhsh}\")\n"
src_func += "\n"
src_func += " res = None\n"
src_func += "\n"
src_func += " # NOTE: if you get an error here, you need to add the model to the if-elif chain below\n"
src_func += " # don't do this manually - use the convert-hf-to-gguf-update.py script!\n"
src_func += f"{src_ifs}\n"
src_func += " if res is None:\n"
src_func += " print( \"\\n\")\n"
src_func += " print( \"**************************************************************************************\")\n"
src_func += " print( \"** WARNING: The BPE pre-tokenizer was not recognized!\")\n"
src_func += " print( \"** This means that it was not added yet or you are using an older version.\")\n"
src_func += " print( \"** Check convert-hf-to-gguf-update.py and update it accordingly.\")\n"
src_func += " print( \"**\")\n"
src_func += " print(f\"** chkhsh: {chkhsh}\")\n"
src_func += " print( \"**************************************************************************************\")\n"
src_func += " print( \"\\n\")\n"
src_func += " raise NotImplementedError(\"BPE pre-tokenizer was not recognized - update get_vocab_base_pre()\")\n"
src_func += "\n"
src_func += " print(f\"tokenizer.ggml.pre: {res}\")\n"
src_func += " print(f\"chkhsh: {chkhsh}\")\n"
src_func += "\n"
src_func += " return res\n"

print(src_func)

print("\n")
print("!!! Copy-paste the function above into convert-hf-to-gguf.py !!!")
print("\n")

Loading
Loading