-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unicode codepoint/grapheme boundaries #34
Comments
Unicode should works. Can you provide the code of parsing (I'll check configuration) and usage (I'll check used functions)? |
My code was something like this: func (p *TokenizerProvider) Init() error {
p.instance = tokenizer.New()
p.instance.DefineTokens(1, []string{",", ".", "!", "?", ";", ":"})
p.instance.DefineTokens(2, []string{"(", ")", "[", "]", "{", "}"})
p.instance.DefineTokens(3, []string{"-", "–", "—"})
p.instance.AllowKeywordSymbols(tokenizer.Underscore, tokenizer.Numbers)
p.instance.DefineStringToken(4, "\"", "\"").
SetEscapeSymbol(tokenizer.BackSlash).
AddSpecialStrings(tokenizer.DefaultSpecialString)
return nil
} func (p *TokenizerProvider) process(chunks []string) (common.AnyTokenSliceWrapper, error) {
tsw := &common.TknSliceWrapper{}
// chunks is []string{testString}
for _, chunk := range chunks {
stream := p.instance.ParseString(chunk)
defer stream.Close()
for stream.IsValid() {
token := stream.CurrentToken()
// the []string provided in the issue comes from the Surfaces of all the tokens of tsw
commonToken := common.Tkn{
Surface: token.ValueString(),
}
tsw.Append(commonToken)
stream.GoNext()
}
}
return tsw, nil
} |
Yep, it's a bug with unicode detection. |
|
Ok, after some investigation I found out that current bug is not a bug. I mean, that for golang's unicode any DevanagariSigns := []rune{
// Candrabindu to Visarga
'\u0900', // Devanagari Sign Inverted Candrabindu
'\u0901', // Devanagari Sign Candrabindu
'\u0902', // Devanagari Sign Anusvara
'\u0903', // Devanagari Sign Visarga
// Stress Signs
'\u0951', // Devanagari Stress Sign Udatta
'\u0952', // Devanagari Stress Sign Anudatta
'\u0953', // Devanagari Grave Accent
'\u0954', // Devanagari Acute Accent
// Various Signs
'\u093C', // Devanagari Sign Nukta
'\u094D', // Devanagari Sign Virama
// Vedic Signs
'\u1CD0', // Vedic Tone Karshana
'\u1CD1', // Vedic Tone Shara
'\u1CD2', // Vedic Tone Prenkha
'\u1CDA', // Vedic Tone Double Svarita
'\u1CDB', // Vedic Tone Triple Svarita
'\u1CDC', // Vedic Tone Kathaka Anudatta
'\u1CDD', // Vedic Tone Dot Below
'\u1CDE', // Vedic Tone Two Dots Below
'\u1CDF', // Vedic Tone Three Dots Below
// Length Mark
'\u0955', // Devanagari Vowel Sign Candra Long E
'\u0956', // Devanagari Vowel Sign UE
'\u0957', // Devanagari Vowel Sign UUE
// Sign Spacing Candrabindu
'\uA8F3', // Devanagari Sign Candrabindu Virama
'\uA8F4', // Devanagari Sign Double Candrabindu Virama
'\uA8F5', // Devanagari Sign Candrabindu Two
'\uA8F6', // Devanagari Sign Candrabindu Three
'\uA8F7', // Devanagari Sign Candrabindu Avagraha
// Om
'\u0950', // Devanagari Om
// Vowel Signs
'\u093E', // Devanagari Vowel Sign AA
'\u093F', // Devanagari Vowel Sign I
'\u0940', // Devanagari Vowel Sign II
'\u0941', // Devanagari Vowel Sign U
'\u0942', // Devanagari Vowel Sign UU
'\u0943', // Devanagari Vowel Sign Vocalic R
'\u0944', // Devanagari Vowel Sign Vocalic RR
'\u0945', // Devanagari Vowel Sign Candra E
'\u0946', // Devanagari Vowel Sign Short E
'\u0947', // Devanagari Vowel Sign E
'\u0948', // Devanagari Vowel Sign AI
'\u0949', // Devanagari Vowel Sign Candra O
'\u094A', // Devanagari Vowel Sign Short O
'\u094B', // Devanagari Vowel Sign O
'\u094C', // Devanagari Vowel Sign AU
'\u094E', // Devanagari Vowel Sign Prishthamatra E
'\u094F', // Devanagari Vowel Sign AW
}
p.instance.AllowKeywordSymbols(append(Underscore, DevanagariSigns...), append(Numbers, DevanagariSigns...))
// now everything works fine (I am not sure that Devanagari Sign can be first rune in a word) |
Interesting! I personally have end up using uniseg's FirstWordInString because I need to support arbitrary language scripts and it seems to handle things at grapheme level and but it would be useful to mention this caveat on the README for future users. |
I think I will add maker(sign) support as a settings via |
I have used this tokenizer with this test string:
"जब मैंने सुबह उठकर खिड़की खोली तो मैंने देखा कि बाहर एक सुंदर सा पक्षी गुनगुनाते हुए पेड़ की डाल पर बैठा था, जिसका रंग सुनहरा था और जिसकी आँखें चमकदार नीली थीं, वह इतना खूबसूरत था कि मैं उसे देखकर मंत्रमुग्ध हो गया।"
And got:
In indexes where diacritics are expected there seems to be several separated one-byte long sequence.
So, if I understand correctly this string tokenizer does not try to honor Unicode codepoint/grapheme boundaries in the byte sequence and therefore it is primarily meant for escaped, ASCII-like content right?
It would be nice to have it mentioned in the description that these unicode boundaries are not considered. (github.com/rivo/uniseg seems to be the tool for that).
Thanks
The text was updated successfully, but these errors were encountered: