Support for tokenization of languages without spaces #4

andreekeberg · 2021-07-24T01:03:03Z

Need to implement a smarter method of tokenization which takes into account languages that traditionally does not use spaces between words (currently resulting in full-sentence tokens not suitable for the current method of cosine similarity comparisons).

Some of these languages include:

Chinese
Japanese
Thai
Khmer
Lao
Burmese

andreekeberg added 🥳 enhancement New feature or request 🙋🏼‍♂️ help wanted Extra attention is appreciated 👋🏼 good first issue Great for new contributors labels Jul 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for tokenization of languages without spaces #4

Support for tokenization of languages without spaces #4

andreekeberg commented Jul 24, 2021

Support for tokenization of languages without spaces #4

Support for tokenization of languages without spaces #4

Comments

andreekeberg commented Jul 24, 2021