Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizer in bert.cpp is not good enough, how about tokenizers-cpp #36

Open
FFengIll opened this issue Sep 20, 2023 · 11 comments
Open

Tokenizer in bert.cpp is not good enough, how about tokenizers-cpp #36

FFengIll opened this issue Sep 20, 2023 · 11 comments

Comments

@FFengIll
Copy link

As mention in title, https://github.com/mlc-ai/tokenizers-cpp is a good implement for token.
Maybe persons do not like another dependency, but it is worthy.

@FFengIll
Copy link
Author

Using a mature implementation is helpful

  • case / uncase may both work.
  • CJK will work.
  • it is effecient.

@FFengIll
Copy link
Author

FFengIll commented Sep 20, 2023

I believe it works (demo as bellow).

image

But it won't be done right away.

@redthing1
Copy link

Great!

@FFengIll
Copy link
Author

FFengIll commented Sep 21, 2023

After review and testing, I found current tokenize implement is not efficient enough.
Import tokenizers-cpp boost much. (just for convenience)

Bellow is the benchmark on my laptop (via m2 max).

image

@cgisky1980
Copy link

I believe it works (demo as bellow).

image

But it won't be done right away.

E5 or m3e? that's great!

@skeskinen
Copy link
Owner

This is a very exiting direction, and huge props to FFenglll for getting this working.
Unfortunately I don't have much time to work on this project right now, but I think people would enjoy the changes you've made recently.

The usecase I originally had for this project is no longer valid, so I'm not as invested in making this library "production quality".

So I have 2 suggestions on how to share your changes:

  1. Make a new repo with the updates and I'll add a link to it at the beginning of the readme, saying that version is more up to date
  2. I can give you contributor rights to this repo and the huggingface storage for the pre-converted models. I can still do code reviews and answer any questions, but you'd be free to set the direction for the project to your liking (e.g. changing the tokenizer)

@FFengIll
Copy link
Author

FFengIll commented Sep 26, 2023

This is a very exiting direction, and huge props to FFenglll for getting this working. Unfortunately I don't have much time to work on this project right now, but I think people would enjoy the changes you've made recently.

The usecase I originally had for this project is no longer valid, so I'm not as invested in making this library "production quality".

So I have 2 suggestions on how to share your changes:

  1. Make a new repo with the updates and I'll add a link to it at the beginning of the readme, saying that version is more up to date
  2. I can give you contributor rights to this repo and the huggingface storage for the pre-converted models. I can still do code reviews and answer any questions, but you'd be free to set the direction for the project to your liking (e.g. changing the tokenizer)

@skeskinen Thanks for your suggestions.

After thinking, I wish to build a new repo (actually still a fork) with name embedding.cpp.

of course, orginal info for bert.cpp is kept.

The major reason is what I want is just an efficient text embedding tool which can be deployed standalone.

I've worked on this area for some time, and very glad to see bert.cpp and join in.

Some other minor reasons might be

  • I am eager for efficiency, but tokenizers-cpp will import rust into build, someone may not like the way.
  • Changes are large and happens frequently, holding in a fork might be better.

@FFengIll
Copy link
Author

@cgisky1980 both of them work well.

@cgisky1980
Copy link

This is a very exiting direction, and huge props to FFenglll for getting this working. Unfortunately I don't have much time to work on this project right now, but I think people would enjoy the changes you've made recently.
The usecase I originally had for this project is no longer valid, so I'm not as invested in making this library "production quality".
So I have 2 suggestions on how to share your changes:

  1. Make a new repo with the updates and I'll add a link to it at the beginning of the readme, saying that version is more up to date
  2. I can give you contributor rights to this repo and the huggingface storage for the pre-converted models. I can still do code reviews and answer any questions, but you'd be free to set the direction for the project to your liking (e.g. changing the tokenizer)

@skeskinen Thanks for your suggestions.

After thinking, I wish to build a new repo (actually still a fork) with name embedding.cpp.

of course, orginal info for bert.cpp is kept.

The major reason is what I want is just an efficient text embedding tool which can be deployed standalone.

I've worked on this area for some time, and very glad to see bert.cpp and join in.

Some other minor reasons might be

  • I am eager for efficiency, but tokenizers-cpp will import rust into build, someone may not like the way.
  • Changes are large and happens frequently, holding in a fork might be better.

THX. where is the new repo?

@cgisky1980
Copy link

@FFengIll need embedding.cpp

@FFengIll
Copy link
Author

Here is the repo: https://github.com/FFengIll/embedding.cpp

And I must remind that it is WIP and not stable enough.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants