Apostrophes #187

ojwb · 2023-12-07T04:45:25Z

We seem to have a lack of consistency in how we expect apostrophes to be handled by code using Snowball stemmers, which means currently tokenisation before stemming needs to encode knowledge of the stemmer to be used. This is explicitly noted in the docs, but that doesn't make it any less unhelpful:

What is a word? For indexing purposes, a word in a European language is a sequence of letters bounded by non-letters. But in English, an internal apostrophe does not split a word, although it is not classed as a letter. The treatment of these word boundary characters affects the stemmer. For example, the Kraaij Pohlmann stemmer for Dutch (Kraaij, 1994, 1995) removes hyphen and treats apostrophe as part of the alphabet (so 's, 'tje and 'je are three of their endings). The Dutch stemmer presented here assumes hyphen and apostrophe have already been removed from the word to be stemmed.

Rather contradicting the text quoted above, the English stemmer expects apostrophes to be treated as a letter: "the English stemmer treats apostrophe as a letter" (https://snowballstem.org/texts/apostrophe.html).
Catalan includes suffixes containing apostrophe.
Irish includes prefixes containing apostrophe.
As above, the Kraaij Pohlmann stemmer also expects apostrophes to be treated as a word character.
But the "Dutch" stemmer doesn't.
The French stemmer doesn't either (and so doesn't expect l' and d' prefixes to be present on input).

If it's feasible then I think it'd be more helpful for all the stemmers to handle apostrophe being treated as a word character. If there's a reason why we can't, then we should provide some sort of metadata (e.g. an "apostrophe_is_word_character" flag that can be queried on each stemmer) so that code using the stemmers can automatically configure their tokenisation stage.

The text was updated successfully, but these errors were encountered:

ojwb · 2023-12-07T04:57:02Z

A full list of French elisions seems to be: l', j', c', m', t', s', n', d', qu'

Include ASCII apostrophe in standard "letter" regexps. See snowballstem/snowball#187

See snowballstem/snowball#187

See #187

See snowballstem/snowball#187

ojwb · 2025-01-31T00:53:10Z

I have adjusted the French stemmer to remove elisions.

Still to do:

Handle Unicode apostrophe too anywhere that handles ASCII apostrophe. The wrinkle here is that U+2019 (and any other similar characters we might want to also treat as an apostrophe) is not present in iso-8859-1. We ideally want to avoid having charset-specific variants of an algorithm (we used to and they got out of step in some cases). Probably the answer is to treat characters that can't be encoded in the specified character set as characters we won't see, at least in some cases - perhaps it needs a way to mark characters as "optional" so that e.g. trying to generate an iso-8859-1 version of the Arabic stemmer doesn't succeed with a useless result.
Do something about Dutch? Resolving Mistakes in the Dutch stemmer #1 by switching to Kraaij Pohlmann (or a merged Dutch stemmer taking the best of both would resolve this), and it would be good to finally resolve that, but maybe it's too much of a can of worms.
Update https://snowballstem.org/texts/introduction.html
Update https://snowballstem.org/texts/apostrophe.html

ojwb mentioned this issue Dec 8, 2023

Turkish proper noun suffixes #188

Closed

ojwb added a commit to snowballstem/snowball-data that referenced this issue Jan 30, 2025

wikipedia-dump-to-freq: Include ASCII apostrophe

5c427da

Include ASCII apostrophe in standard "letter" regexps. See snowballstem/snowball#187

ojwb added a commit to snowballstem/snowball-data that referenced this issue Jan 30, 2025

french: Add ASCII apostrophe as word character

c437f9f

See snowballstem/snowball#187

ojwb added a commit that referenced this issue Jan 30, 2025

french: Remove elisions as first step

664b989

See #187

ojwb added a commit to snowballstem/snowball-website that referenced this issue Jan 31, 2025

french: Update to include removal of elisions

1701105

See snowballstem/snowball#187

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apostrophes #187

Apostrophes #187

ojwb commented Dec 7, 2023 •

edited

Loading

ojwb commented Dec 7, 2023

ojwb commented Jan 31, 2025

Apostrophes #187

Apostrophes #187

Comments

ojwb commented Dec 7, 2023 • edited Loading

ojwb commented Dec 7, 2023

ojwb commented Jan 31, 2025

ojwb commented Dec 7, 2023 •

edited

Loading