Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode codepoint/grapheme boundaries #34

Open
tassa-yoniso-manasi-karoto opened this issue Jan 19, 2025 · 7 comments
Open

Unicode codepoint/grapheme boundaries #34

tassa-yoniso-manasi-karoto opened this issue Jan 19, 2025 · 7 comments
Assignees
Labels
enhancement New feature or request

Comments

@tassa-yoniso-manasi-karoto

I have used this tokenizer with this test string:
"जब मैंने सुबह उठकर खिड़की खोली तो मैंने देखा कि बाहर एक सुंदर सा पक्षी गुनगुनाते हुए पेड़ की डाल पर बैठा था, जिसका रंग सुनहरा था और जिसकी आँखें चमकदार नीली थीं, वह इतना खूबसूरत था कि मैं उसे देखकर मंत्रमुग्ध हो गया।"

And got:

[]string{
	"जब",
	"म",
	"\xe0",
	"\xa5",
	"\x88",
	"\xe0",
	"\xa4",
	"\x82",
	"न",
	"\xe0",
	"\xa5",
	"\x87",
	"स",
	// and so on
}

In indexes where diacritics are expected there seems to be several separated one-byte long sequence.

So, if I understand correctly this string tokenizer does not try to honor Unicode codepoint/grapheme boundaries in the byte sequence and therefore it is primarily meant for escaped, ASCII-like content right?

It would be nice to have it mentioned in the description that these unicode boundaries are not considered. (github.com/rivo/uniseg seems to be the tool for that).

Thanks

@bzick bzick self-assigned this Jan 19, 2025
@bzick
Copy link
Owner

bzick commented Jan 19, 2025

Unicode should works. Can you provide the code of parsing (I'll check configuration) and usage (I'll check used functions)?

@tassa-yoniso-manasi-karoto
Copy link
Author

tassa-yoniso-manasi-karoto commented Jan 19, 2025

My code was something like this:

func (p *TokenizerProvider) Init() error {
	p.instance = tokenizer.New()
	
	p.instance.DefineTokens(1, []string{",", ".", "!", "?", ";", ":"})
	p.instance.DefineTokens(2, []string{"(", ")", "[", "]", "{", "}"})
	p.instance.DefineTokens(3, []string{"-", "–", "—"})
	
	p.instance.AllowKeywordSymbols(tokenizer.Underscore, tokenizer.Numbers)
	
	p.instance.DefineStringToken(4, "\"", "\"").
		SetEscapeSymbol(tokenizer.BackSlash).
		AddSpecialStrings(tokenizer.DefaultSpecialString)
	
	return nil
}
func (p *TokenizerProvider) process(chunks []string) (common.AnyTokenSliceWrapper, error) {
	tsw := &common.TknSliceWrapper{}
	// chunks is []string{testString}
	for _, chunk := range chunks {
		stream := p.instance.ParseString(chunk)
		defer stream.Close()

		for stream.IsValid() {
			token := stream.CurrentToken()
			
			// the []string provided in the issue comes from the Surfaces of all the tokens of tsw
			commonToken := common.Tkn{
				Surface:    token.ValueString(),
			}

			tsw.Append(commonToken)
			stream.GoNext()
		}
	}

	return tsw, nil
}

@bzick
Copy link
Owner

bzick commented Jan 20, 2025

Yep, it's a bug with unicode detection. Bug found and will be fixed ASAP Continuing my investigations.

@bzick
Copy link
Owner

bzick commented Jan 20, 2025

Can I use the string जब मैंने सुबह उठकर खिड़की खोली तो मैंने देखा कि बाहर एक सुंदर सा पक्षी गुनगुनाते हुए पेड़ की डाल पर बैठा था, जिसका रंग सुनहरा था और जिसकी आँखें चमकदार नीली थीं, वह इतना खूबसूरत था कि मैं उसे देखकर मंत्रमुग्ध हो गया। in tests? Not needed

@bzick
Copy link
Owner

bzick commented Jan 20, 2025

Ok, after some investigation I found out that current bug is not a bug. I mean, that for golang's unicode any Devanagari Sign is not a letters (unicode.IsLetter(DevanagariSign) == false). By this reason any Devanagari Sign parsed as unknown token and broke keywords as you wrote. To solve the problem you have to add all Devanagari Signs to keyword symbols:

DevanagariSigns := []rune{
		// Candrabindu to Visarga
		'\u0900', // Devanagari Sign Inverted Candrabindu
		'\u0901', // Devanagari Sign Candrabindu
		'\u0902', // Devanagari Sign Anusvara
		'\u0903', // Devanagari Sign Visarga

		// Stress Signs
		'\u0951', // Devanagari Stress Sign Udatta
		'\u0952', // Devanagari Stress Sign Anudatta
		'\u0953', // Devanagari Grave Accent
		'\u0954', // Devanagari Acute Accent

		// Various Signs
		'\u093C', // Devanagari Sign Nukta
		'\u094D', // Devanagari Sign Virama

		// Vedic Signs
		'\u1CD0', // Vedic Tone Karshana
		'\u1CD1', // Vedic Tone Shara
		'\u1CD2', // Vedic Tone Prenkha
		'\u1CDA', // Vedic Tone Double Svarita
		'\u1CDB', // Vedic Tone Triple Svarita
		'\u1CDC', // Vedic Tone Kathaka Anudatta
		'\u1CDD', // Vedic Tone Dot Below
		'\u1CDE', // Vedic Tone Two Dots Below
		'\u1CDF', // Vedic Tone Three Dots Below

		// Length Mark
		'\u0955', // Devanagari Vowel Sign Candra Long E
		'\u0956', // Devanagari Vowel Sign UE
		'\u0957', // Devanagari Vowel Sign UUE

		// Sign Spacing Candrabindu
		'\uA8F3', // Devanagari Sign Candrabindu Virama
		'\uA8F4', // Devanagari Sign Double Candrabindu Virama
		'\uA8F5', // Devanagari Sign Candrabindu Two
		'\uA8F6', // Devanagari Sign Candrabindu Three
		'\uA8F7', // Devanagari Sign Candrabindu Avagraha

		// Om
		'\u0950', // Devanagari Om

		// Vowel Signs
		'\u093E', // Devanagari Vowel Sign AA
		'\u093F', // Devanagari Vowel Sign I
		'\u0940', // Devanagari Vowel Sign II
		'\u0941', // Devanagari Vowel Sign U
		'\u0942', // Devanagari Vowel Sign UU
		'\u0943', // Devanagari Vowel Sign Vocalic R
		'\u0944', // Devanagari Vowel Sign Vocalic RR
		'\u0945', // Devanagari Vowel Sign Candra E
		'\u0946', // Devanagari Vowel Sign Short E
		'\u0947', // Devanagari Vowel Sign E
		'\u0948', // Devanagari Vowel Sign AI
		'\u0949', // Devanagari Vowel Sign Candra O
		'\u094A', // Devanagari Vowel Sign Short O
		'\u094B', // Devanagari Vowel Sign O
		'\u094C', // Devanagari Vowel Sign AU
		'\u094E', // Devanagari Vowel Sign Prishthamatra E
		'\u094F', // Devanagari Vowel Sign AW
	}
	p.instance.AllowKeywordSymbols(append(Underscore, DevanagariSigns...), append(Numbers, DevanagariSigns...))

	// now everything works fine

(I am not sure that Devanagari Sign can be first rune in a word)

@tassa-yoniso-manasi-karoto
Copy link
Author

Interesting! I personally have end up using uniseg's FirstWordInString because I need to support arbitrary language scripts and it seems to handle things at grapheme level and but it would be useful to mention this caveat on the README for future users.

@bzick
Copy link
Owner

bzick commented Jan 21, 2025

I think I will add maker(sign) support as a settings via unicode.IsMarker()

@bzick bzick added the enhancement New feature or request label Jan 27, 2025
bzick added a commit that referenced this issue Feb 13, 2025
bzick added a commit that referenced this issue Feb 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants