Support for tokenization of languages without spaces #4
Labels
🥳 enhancement
New feature or request
👋🏼 good first issue
Great for new contributors
🙋🏼♂️ help wanted
Extra attention is appreciated
Need to implement a smarter method of tokenization which takes into account languages that traditionally does not use spaces between words (currently resulting in full-sentence tokens not suitable for the current method of cosine similarity comparisons).
Some of these languages include:
The text was updated successfully, but these errors were encountered: