-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
subword #
should be an option.
#33
Comments
Furthermore, there is no such rule to force use |
In model, the substr symbol always be called as |
Hi, I was wondering about the subword rules also with regards to #31 Does the code in 31 handle subwords? Also, unrelated but a thought I had earlier: it would be nice to convert test_tokenizer.cpp to python and run the tests against the reference tokenizers |
@skeskinen no, #31 only make vocab not necessary (because it maybe missing). This issue is another problem for subwords ( I found this since I meet too many unknown token when using e5). bellow is some token samples in bert-based model. in m3e, subword is ## like many bert model.
in e5, subword is
|
For now, I do not have a good idea for this issue, so I do not implement a PR for it. |
加油,需要跨平台的中英文向量化~ E5 多语言版就不错 |
For bert, there are many models use
#
for subword symbol, but not all.Some popular bert-based models defined their own subword symbol.
For example, in
e5
the symbol is▁
.The text was updated successfully, but these errors were encountered: