-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Apostrophes #187
Comments
A full list of French elisions seems to be: |
ojwb
added a commit
to snowballstem/snowball-data
that referenced
this issue
Jan 30, 2025
Include ASCII apostrophe in standard "letter" regexps. See snowballstem/snowball#187
ojwb
added a commit
to snowballstem/snowball-data
that referenced
this issue
Jan 30, 2025
ojwb
added a commit
that referenced
this issue
Jan 30, 2025
ojwb
added a commit
to snowballstem/snowball-website
that referenced
this issue
Jan 31, 2025
I have adjusted the French stemmer to remove elisions. Still to do:
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
We seem to have a lack of consistency in how we expect apostrophes to be handled by code using Snowball stemmers, which means currently tokenisation before stemming needs to encode knowledge of the stemmer to be used. This is explicitly noted in the docs, but that doesn't make it any less unhelpful:
Rather contradicting the text quoted above, the English stemmer expects apostrophes to be treated as a letter: "the English stemmer treats apostrophe as a letter" (https://snowballstem.org/texts/apostrophe.html).
Catalan includes suffixes containing apostrophe.
Irish includes prefixes containing apostrophe.
As above, the Kraaij Pohlmann stemmer also expects apostrophes to be treated as a word character.
But the "Dutch" stemmer doesn't.
The French stemmer doesn't either (and so doesn't expect
l'
andd'
prefixes to be present on input).If it's feasible then I think it'd be more helpful for all the stemmers to handle apostrophe being treated as a word character. If there's a reason why we can't, then we should provide some sort of metadata (e.g. an "apostrophe_is_word_character" flag that can be queried on each stemmer) so that code using the stemmers can automatically configure their tokenisation stage.
The text was updated successfully, but these errors were encountered: