Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add changes to handle jina v2 chinese code #7795

Open
wants to merge 57 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 51 commits
Commits
Show all changes
57 commits
Select commit Hold shift + click to select a range
86a5d96
feat: first things to do
JoanFM Apr 11, 2024
747d17a
feat: create tensors for Jina architecture
JoanFM Apr 12, 2024
a40156a
fix: use other tensors
JoanFM Apr 12, 2024
b00d38b
feat: embedding gets results
JoanFM Apr 16, 2024
cf1c144
fix: fix usage of ALIBI
JoanFM Apr 22, 2024
63a1d7c
fix: clean prints
JoanFM Apr 22, 2024
c229e48
fix: do some cleanup unused vars
JoanFM Apr 22, 2024
e232370
fix: revert changes to Makefile and CMakeLists
JoanFM Apr 22, 2024
795ff1d
fix: revert some changes
JoanFM Apr 22, 2024
d6ac931
fix: fix small detail
JoanFM Apr 22, 2024
db7e8ce
Merge branch 'master' into feat-jina-embeddings
JoanFM Apr 22, 2024
c1c0f4d
fix: fix convert formatting
JoanFM Apr 22, 2024
64cd4b1
fix: fix linting and editor
JoanFM Apr 22, 2024
71ff763
feat: set proper vocab settings
JoanFM Apr 22, 2024
d7d6a4e
fix: JinaBertForMaskedLM registration
JoanFM Apr 23, 2024
cde49b7
feat: support q_normalization and k_normalization in Jina arch
JoanFM Apr 23, 2024
dd060a2
feat: handle gpt2 tokenizer with Jina architecture
JoanFM Apr 24, 2024
dfa0676
feat: example comments in embedding
JoanFM Apr 24, 2024
c3f4b1f
feat: rename Jina Bert to Jina Bert V2
JoanFM Apr 24, 2024
603f18b
feat: small changes to allow jina embeddings ZH model
JoanFM Apr 29, 2024
f8d1709
Merge branch 'master' into feat-jina-embeddings
JoanFM Apr 30, 2024
da96368
fix: add some changes as per review
JoanFM Apr 30, 2024
2835441
Merge branch 'feat-jina-embeddings' of https://github.com/JoanFM/llam…
JoanFM Apr 30, 2024
d9b8dd6
fix: add some changes as per review
JoanFM Apr 30, 2024
e73ab4b
Merge branch 'feat-jina-embeddings' of https://github.com/JoanFM/llam…
JoanFM Apr 30, 2024
14073a2
feat: proper KQ_pos for Jina embeddings
JoanFM Apr 30, 2024
f6365b8
Merge branch 'feat-jina-embeddings' of https://github.com/JoanFM/llam…
JoanFM May 2, 2024
14cd69a
feat: add pre tokenization
JoanFM May 2, 2024
d5c3525
feat: first iteration NFC
JoanFM May 6, 2024
76436c1
Merge branch 'master' of https://github.com/JoanFM/llama.cpp into fea…
JoanFM May 6, 2024
365af24
Merge branch 'feat-jina-embeddings' of https://github.com/JoanFM/llam…
JoanFM May 6, 2024
3269efe
Merge branch 'master' of https://github.com/JoanFM/llama.cpp into fea…
JoanFM May 11, 2024
d0a99aa
Merge branch 'master' of https://github.com/JoanFM/llama.cpp into fea…
JoanFM May 13, 2024
8957cac
refactor: rename jina tokenizers to v2
JoanFM May 13, 2024
0771b17
Merge branch 'refactor-jina-rename' of https://github.com/JoanFM/llam…
JoanFM May 13, 2024
22a0113
fix: fix alignment
JoanFM May 13, 2024
fb83012
refactor: keep refactoring non-breaking
JoanFM May 13, 2024
ea0f7df
Merge branch 'refactor-jina-rename' of https://github.com/JoanFM/llam…
JoanFM May 13, 2024
22b5f6b
Merge branch 'master' of https://github.com/JoanFM/llama.cpp into fea…
JoanFM May 13, 2024
cc0ac09
feat: add changes to handle jina v2 base code
JoanFM May 28, 2024
21936dd
fix: do not complicate things
JoanFM May 28, 2024
9a65c7a
fix: fix the usage of the code model
JoanFM May 31, 2024
96a6f55
Merge branch 'master' of https://github.com/JoanFM/llama.cpp into fea…
JoanFM May 31, 2024
0fc775e
Merge branch 'master' of https://github.com/JoanFM/llama.cpp into fea…
JoanFM Jun 4, 2024
4bce30c
fix: fix comments
JoanFM Jun 4, 2024
3b44f8f
fix: fix linting issues
JoanFM Jun 5, 2024
05659d3
fix: remove ollama patches
JoanFM Jun 5, 2024
7ab6023
Merge branch 'master' of https://github.com/JoanFM/llama.cpp into fea…
JoanFM Jun 5, 2024
d86efa6
fix: merge with code
JoanFM Jun 5, 2024
a8a64fd
fix: fix preprocessing jina v2 zh
JoanFM Jun 6, 2024
605a619
fix: merge issues
JoanFM Jun 6, 2024
728e1b4
fix: lowercase unicode pt by unicode pt
JoanFM Jun 7, 2024
841b9a5
Merge branch 'master' into feat-jina-embeddings-v2-zh
JoanFM Jun 18, 2024
175391d
merge with master
JoanFM Jul 8, 2024
0699a4c
Merge branch 'feat-jina-embeddings-v2-zh' of https://github.com/JoanF…
JoanFM Jul 8, 2024
afd76e6
fix: handle default
JoanFM Jul 8, 2024
201559d
Merge branch 'master' of https://github.com/JoanFM/llama.cpp into fea…
JoanFM Jul 26, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions convert-hf-to-gguf-update.py
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,7 @@ class TOKENIZER_TYPE(IntEnum):
{"name": "jina-v2-de", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/jinaai/jina-embeddings-v2-base-de", },
{"name": "smaug-bpe", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/abacusai/Smaug-Llama-3-70B-Instruct", },
{"name": "jina-v2-code", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/jinaai/jina-embeddings-v2-base-code", },
{"name": "jina-v2-zh", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/jinaai/jina-embeddings-v2-base-zh", },
]


Expand Down
3 changes: 3 additions & 0 deletions convert-hf-to-gguf.py
Original file line number Diff line number Diff line change
Expand Up @@ -478,6 +478,9 @@ def get_vocab_base_pre(self, tokenizer) -> str:
if chkhsh == "7967bfa498ade6b757b064f31e964dddbb80f8f9a4d68d4ba7998fcf281c531a":
# ref: https://huggingface.co/jinaai/jina-embeddings-v2-base-code
res = "jina-v2-code"
if chkhsh == "c7699093ba4255a91e702aa38a596aa81669f3525dae06c2953267dde580f448":
# ref: https://huggingface.co/jinaai/jina-embeddings-v2-base-zh
res = "jina-v2-zh"

if res is None:
logger.warning("\n")
Expand Down
39 changes: 32 additions & 7 deletions llama.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -4684,8 +4684,8 @@ static void llm_load_vocab(
tokenizer_pre == "jina-v2-de" ||
tokenizer_pre == "jina-v2-code") {
vocab.type_pre = LLAMA_VOCAB_PRE_TYPE_GPT2;
} else if (
tokenizer_pre == "refact") {

} else if (tokenizer_pre == "refact") {
vocab.type_pre = LLAMA_VOCAB_PRE_TYPE_REFACT;
} else if (
tokenizer_pre == "command-r") {
Expand All @@ -4705,6 +4705,9 @@ static void llm_load_vocab(
} else if (
tokenizer_pre == "smaug-bpe") {
vocab.type_pre = LLAMA_VOCAB_PRE_TYPE_SMAUG;
} else if (
tokenizer_pre == "jina-v2-zh") {
vocab.type_pre = LLAMA_VOCAB_PRE_TYPE_JINA_V2_ZH;
} else {
throw std::runtime_error(format("unknown pre-tokenizer type: '%s'", tokenizer_pre.c_str()));
}
Expand Down Expand Up @@ -4736,8 +4739,7 @@ static void llm_load_vocab(

for (uint32_t i = 0; i < n_vocab; i++) {
std::string word = gguf_get_arr_str(ctx, token_idx, i);
GGML_ASSERT(unicode_cpts_from_utf8(word).size() > 0);

//GGML_ASSERT(unicode_cpts_from_utf8(word).size() > 0); Remove check, some vocabs contain by mistake the NULL in vocab, (not ideal if it happens more than once) (jinaai-embeddings-v2-base-zh)
vocab.token_to_id[word] = i;

auto & token_data = vocab.id_to_token[i];
Expand Down Expand Up @@ -4771,9 +4773,18 @@ static void llm_load_vocab(
} else if (vocab.type == LLAMA_VOCAB_TYPE_WPM) {
vocab.linefeed_id = vocab.special_pad_id;
} else {
const std::vector<int> ids = llama_tokenize_internal(vocab, "\xC4\x8A", false); // U+010A
GGML_ASSERT(!ids.empty() && "model vocab missing newline token");
vocab.linefeed_id = ids[0];
try {
const std::vector<int> ids = llama_tokenize_internal(vocab, "\xC4\x8A", false); // U+010A
if (ids.empty()) {
LLAMA_LOG_WARN("%s: %s vocabulary, but newline token not found: %s! Using special_pad_id instead.", __func__, llama_model_vocab_type_name(vocab.type), "\xC4\x8A");
vocab.linefeed_id = -1;
} else {
vocab.linefeed_id = ids[0];
}
} catch (const std::exception & e) {
LLAMA_LOG_WARN("%s: %s vocabulary, but newline token not found: %s! Using special_pad_id instead.", __func__, llama_model_vocab_type_name(vocab.type), e.what());
vocab.linefeed_id = vocab.special_pad_id;
}
}

// special tokens
Expand Down Expand Up @@ -13025,6 +13036,20 @@ struct llm_tokenizer_bpe {
"(?:'[sS]|'[tT]|'[rR][eE]|'[vV][eE]|'[mM]|'[lL][lL]|'[dD])|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+",
});
break;
case LLAMA_VOCAB_PRE_TYPE_JINA_V2_ZH:
//TODO: Apply lowercase + whitespace pretokenization
{
std::string lowercase_text = text;
std::transform(lowercase_text.begin(), lowercase_text.end(), lowercase_text.begin(), [](unsigned char c){ return std::tolower(c); });
JoanFM marked this conversation as resolved.
Show resolved Hide resolved
std::regex regexPattern("\\w+|[^\\w\\s]+");
std::sregex_token_iterator it(lowercase_text.begin(), lowercase_text.end(), regexPattern);
std::sregex_token_iterator end;
JoanFM marked this conversation as resolved.
Show resolved Hide resolved

while (it != end) {
word_collection.push_back(*it++);
}
}
break;
default:
// default regex for BPE tokenization pre-processing
word_collection = unicode_regex_split(text, {
Expand Down
1 change: 1 addition & 0 deletions llama.h
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,7 @@ extern "C" {
LLAMA_VOCAB_PRE_TYPE_OLMO = 12,
LLAMA_VOCAB_PRE_TYPE_DBRX = 13,
LLAMA_VOCAB_PRE_TYPE_SMAUG = 14,
LLAMA_VOCAB_PRE_TYPE_JINA_V2_ZH = 15,
};

// note: these values should be synchronized with ggml_rope
Expand Down
Loading