-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement a simple search feature #8
Comments
I think this is a great feature that is really needed.
Some comments regarding Unicode. Firstly, we must make sure that
irrespective of how the user enters the text, it is decomposed so that
searching works properly. The problem lies in that diacritical marks
(mostly for the Latin and Greek alphabets) can be entered in one of two
ways: either as a precomposed character *ä* or as a decomposed character
*ä* (that is as *a + ◌̈*). Although visually both look identical, the
underlying representation is different. According to Unicode specifics,
both should be treated identically. However, this needs to be checked that
it has been so implemented. I am afraid that JAVA may not implement this
feature correctly.
As well, regarding Church Slavonic searching, I think that it would be
mandatory to have two options: strict and relaxed. In strict, the search
engine searches for the exact spelling of the word. In relaxed, the search
engine searches using a normalised form of the word (for example,
diacritical marks are stripped and {и, і}, {е, є}, {о, ѻ, Ѡ}, {ꙗ, ѧ} (as
examples) are treated within each set as equivalent). As well, superscript
letters would need to be handled somehow. Finally, abbreviations could be
expanded (I have a list of all (modern) Church Slavonic abbreviations,
which would cover us for all cases). The same could also apply to Greek
with respect to stripping the diacritical marks. This is especially
important since not everyone will necessarily be familiar with exactly how
to spell a word in Church Slavonic and the spelling of the word can change
during word formation, *e.g.* ѻ҆те́цъ (nominative singular), ѻ҆тє́цъ
(genitive plural), and then пра́ѻтецъ, which all should be found if we
search for “ѻтецъ”. Normalising the forms would give *отецъ*, *отецъ*, and
*праотецъ* which will now be easily found.
…On Thu, 24 Dec 2020 at 13:54, Tom L. ***@***.***> wrote:
I'm currently working on this.
[image: afbeelding]
<https://user-images.githubusercontent.com/10900989/103088705-e36af200-45eb-11eb-810a-5164c3776410.png>
I've changed the search results from text to tabular data. Clicking on a
row opens the corresponding commemoration in a new window.
Are there any specific features that should be added?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#8 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABSMKOOOYV5AOESMCLXUMSTSWM2WNANCNFSM4VIFQZQQ>
.
|
I tested with French, and it seems to handle both versions of é fine. I've currently implemented a checkbox that strips the accents from both the search term and the saint name. So far I've been testing in French, since that's a language I actually know. With the checkbox unchecked, the search term "Melece" doesn't give "St. Mélèce" as a result, with the checkbox it does. I've also added a similar checkbox to ignore capitalization. The library I'm using (java.text.Normalizer) can probably normalize the church slavonic to some degree, but I'll probably have to find a way to handle the abbreviations (hardcoding per your list, I guess) and the spelling differences related to word formations. I'm fairly sure the normalization I've implemented so far can handle diacritical marks in Greek, though I'll have to find some examples to be certain. |
For polytonic Greek, I can suggest the form ἅγιος (masculine form of
*holy*). With diacritical marks stripped, it should also match the
monotonic Greek form άγιος (and vice versa). If you need any help with the
Church Slavonic, let me know and I can send you the required files.
As well, there is the question of Chinese normalisation regarding the two
forms of Chinese: simplified and traditional. Can JAVA handle this or not?
If it can, then we should enable it; otherwise it makes little point to
implement. An example to try: traditional: 格奧爾吉; simplified: 格奥尔吉 (both
forms correspond to George in Chinese). Only the middle two characters are
different.
Another question: do you only search the name of the commemoration or do
you search any text in the corresponding html file?
…On Mon, 28 Dec 2020 at 21:42, Tom L. ***@***.***> wrote:
I tested with French, and it seems to handle both versions of é fine.
I've currently implemented a checkbox that strips the accents from both
the search term and the saint name. So far I've been testing in French,
since that's a language I actually know. With the checkbox unchecked, the
search term "Melece" doesn't give "St. Mélèce" as a result, with the
checkbox it does. I've also added a similar checkbox to ignore
capitalization.
The library I'm using (java.text.Normalizer) can probably normalize the
church slavonic to some degree, but I'll probably have to find a way to
handle the abbreviations (hardcoding per your list, I guess) and the
spelling differences related to word formations.
I'm fairly sure the normalization I've implemented so far can handle
diacritical marks in Greek, though I'll have to find some examples to be
certain.
[image: afbeelding]
<https://user-images.githubusercontent.com/10900989/103242123-8194ea00-4955-11eb-839f-058e55da2c83.png>
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#8 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABSMKOOHD6JCNFRTODTZIALSXDUUZANCNFSM4VIFQZQQ>
.
|
To easily test the cases you give me, I think I'm gonna extract some of the methods I've written to a utility class and write tests for them. I'll probably try to write tests for some of the existing classes as well later on. I could use help with the church slavonic as well, since I can't even read Cyrillic (I interpreted the і in your equivalent sets as the Latin i at first, and was looking into romanization. I know better now.). Do you know a good source for all the equivalent sets? I'll implement normalization under the "strip diacritical marks" checkbox in languages that require it, and then the translation strings can be different to indicate it. Currently I'm only searching for the name, but I can easily add a checkbox to search the getLife() as well. |
I can send you the information about equivalent sets and also all the
abbreviations in Church Slavonic. Would you mind if I e-mailed the files
directly to you? I do not wish them to be made public just yet. Would the
e-mail address from your website work?
I think searching on the life as an option could be useful, especially if
we are trying to weed out any errors that may be found in the texts.
…On Tue, 29 Dec 2020 at 11:20, Tom L. ***@***.***> wrote:
To easily test the cases you give me, I think I'm gonna extract some of
the methods I've written to a utility class and write tests for them. I'll
probably try to write tests for some of the existing classes as well later
on.
I could use help with the church slavonic as well, since I can't even read
Cyrillic (I interpreted the і in your equivalent sets as the Latin i at
first, and was looking into romanization. I know better now.). Do you know
a good source for all the equivalent sets?
I'll implement normalization under the "strip diacritical marks" checkbox
in languages that require it, and then the translation strings can be
different to indicate it.
Currently I'm only searching for the name, but I can easily add a checkbox
to search the getLife() as well.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#8 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABSMKOPC6TMJ3FPAGXQA2QLSXGUM7ANCNFSM4VIFQZQQ>
.
|
Original issue reported on code.google.com by
[email protected]
on 6 Feb 2015 at 5:05The text was updated successfully, but these errors were encountered: