Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

simple search, v1.1 #26

Open
funderburkjim opened this issue Jan 25, 2021 · 24 comments
Open

simple search, v1.1 #26

funderburkjim opened this issue Jan 25, 2021 · 24 comments
Labels
enhancement New feature or request

Comments

@funderburkjim
Copy link
Contributor

funderburkjim commented Jan 25, 2021

A new version of simple search is currently available under a 'test' url:

https://sanskrit-lexicon.uni-koeln.de/simplet/

The previous version is also available, under https://www.sanskrit-lexicon.uni-koeln.de/simple/

Would hope to have some users experiment with the new version before making the new version available under
https://sanskrit-lexicon.uni-koeln.de/simple/

@funderburkjim
Copy link
Contributor Author

The new version can be called with parameters DICT and KEY: https://www.sanskrit-lexicon.uni-koeln.de/simplet/DICT/KEY.

But also admits optional additional parameters: /SIMPLE_INPUT/OUTPUT/ACCENT

The SIMPLE_INPUT parameter specifies the assumed spelling of KEY, and this value is visible in another menu.

When not specified in the URL, SIMPLE_INPUT defaults to 'default'. This assumes a phonetic type spelling, which
may be include IAST-type diacritics. This spelling is also not case-sensitive (i.e., all letters are lower-cased before searching).
You can enter KEY in Devanagari with the 'default' SIMPLE_INPUT.

When SIMPLE_INPUT is one of the other values (slp1, hk, itrans), then the spelling of KEY is assumed to use the peculiarities
of the chosen transcoding.

In addition to the SIMPLE_INPUT parameter, some additional enhancements have been made to better model spelling
variations in SKD and other dictionaries. I expect some additional tweaks will be discovered that can be handled within
the current model of simple_search.

@funderburkjim
Copy link
Contributor Author

Working on bug exemplified by 'rupa'. Problem is that 'ru' has variants, but then the variants of 'u' are lost!

@funderburkjim
Copy link
Contributor Author

Some problems from sanskrit-lexicon/COLOGNE#167

resolved

  • vrisapha gives varṣapa AND vṛṣabha
  • vacakah -> vācaka, vacaka
  • rupa -> rupa rūpa arbha rūpā (former version, only rupa returned in mw)
  • KRISNA -> kṛṣṇa kṛṣṇā (formerly no results)
    • KRISN -> kf kfzRa kfSa karza kfz kruS karSana kfS kfSana kfza (WEIRD -- not sure why all those others? )

STILL unresolved

  • yoginaḥ : This is an inflected form. The current model tweaks a few inflected forms (nom. singular: am, aH)
    More common inflected forms could probably be recognized.
  • gṛhastha is recognized, but not gṛhatsha
    • the difference is 'st' and 'ts'. This kind of spelling difference does not fit current search model. Currently hard to solve.
  • hariṇyagarbha instead of hiraṇyagarbha --- Right, current model doesn't know how to handle this
  • kūṭstha still no results - (whereas kūṭastha is found). --

@funderburkjim
Copy link
Contributor Author

@gasyoun (or others) Please point out where some 'low-hanging fruit' improvements to simple search might be.

@funderburkjim
Copy link
Contributor Author

transitions should be different for slp1 than default

  • guru with INPUT_SIMPLE = default:
    • guru guRa guRin GuRa guRana gur guRI gurU gUr Gur GuR GUr guRi GuRi
  • guru with INPUT_SIMPLE = slp1
    • guru guRa guRin GuRa guRana gur guRI gurU gUr Gur GuR GUr guRi GuRi

They are the same.
But, it probably would be reasonable, for slp1, not to use transitions like 'r R', 'g G', etc. Although these are reasonable
for default.
Agree?

@gasyoun gasyoun added the enhancement New feature or request label Jan 25, 2021
@gasyoun
Copy link
Member

gasyoun commented Jan 25, 2021

krsnaa

If I had a grandpa still alive, wish it would be you - Jim the magician. The four resolved ones work perfect.

KRISN -> kf kfzRa kfSa karza kfz kruS karSana kfS kfSana kfza (WEIRD -- not sure why all those others? )

10 results: kṛ kṛṣṇa kṛśa karṣa kṛṣ kruś karśana kṛś kṛśana kṛṣa strange over generation indeed versus just KRISNA

More common inflected forms could probably be recognized.

Right, just a few more common ones.

the difference is 'st' and 'ts'. This kind of spelling difference does not fit current search model. Currently hard to solve.

Yes, and it's not critical. Good to have, because Sanskrit words live in strange ways, but not critical, as hard to solve.

'low-hanging fruit' improvements

  1. I would add a few inflected forms, to cover Nominative forms:
    yoginaḥ
  2. Koush for koṣa

for slp1, not to use transitions like 'r R', 'g G', etc. Although these are reasonable for default.

Agree.

@gasyoun
Copy link
Member

gasyoun commented Jan 27, 2021

watch

@Andhrabharati & @funderburkjim
Tamilised Sanskrit word checklist (https://youtu.be/BW4qa0lBZX4). If I enter the:

  1. tamilish version of Sanskrit anantha, I get ananta as planned (and cologne symlink to apidev #1).
  2. tamilish version of Sanskrit samudhra, I get samudra as planned (and cologne symlink to apidev #1).
  3. tamilish version of Sanskrit thara, I get tāra as planned (and cologne symlink to apidev #1).
  4. tamilish version of Sanskrit poojana, I do not get pūjanā as planned. Even pooja does not work for pūjā
  5. namaḥ has no possible verb listed (5 results: nāman nāma namana nama nāmana)
  6. tamilish version of Sanskrit varahsini, I do not get vārāśini as planned. So ah can equal to ā
  7. tamilish version of Sanskrit dourbhagya, I do not get daurbhāgya as planned. So ou can equal to au.
  8. tamilish version of Sanskrit dhoorikrutha, I do not get dūrīkṛta as planned. We have almost all the replacements, other than the oo and ee (= ī) I guess.
  9. tamilish version of Sanskrit vipathitha, I do not get vipattita as planned, but vipatita and vipāṭita. So t, but would want tt
  10. tamilish version of Sanskrit natha, I do not get natā as planned.
    Instead that 6 results: nata nātha naṭa naṭana naṭā nāṭa where natá is the closest, but does not contain natā.
  11. tamilish version of Sanskrit kadachid, I do not get kadācid as planned, because it's not a base in any of the dictionaries.
    Compare:
    kaccid:BEN;2788,CAE;6847,CCS;4222,MD;5341,MW72;13322,MW;41751,PW;23308,PWG;14284,STC;7514
    kiMcid:MW;50367,PW;27806,PWG;70717,SCH;10618

@Andhrabharati
Copy link

Andhrabharati commented Jan 27, 2021

@gasyoun
Are you trying for some sort of AI (artificial Intelligence, with self-learning/building patterns!) in the search process?

Coming to the subject matter of Tamilish Sanksrit, this is kind of a better style, I should say. I have seen far worst texts, even more unimaginable than the spellings of original DLI titles (now I see that various language-wise teams are working on cleaning those titles).

And finally, why am I addressed in this?!!

@gasyoun
Copy link
Member

gasyoun commented Jan 27, 2021

Are you trying for some sort of AI (artificial Intelligence, with self-learning/building patterns!) in the search process?

non-AI, rule based.

Coming to the subject matter of Tamilish Sanksrit, this is kind of a better style, I should say.

Oh, ok.

I have seen far worst text

Can you show me a sample?

even more unimaginable than the spellings of original DLI titles

Let's document the worst ones?

And finally, why am I addressed in this?!!

You might have some samples I have missed above.

@Andhrabharati
Copy link

11. tamilish version of Sanskrit `kadachid`, I do not get `kadācid` as planned, because it's not a base in any of the dictionaries.
    Compare:
    `kaccid:BEN;2788,CAE;6847,CCS;4222,MD;5341,MW72;13322,MW;41751,PW;23308,PWG;14284,STC;7514`
    `kiMcid:MW;50367,PW;27806,PWG;70717,SCH;10618`

The reason for not finding kadācid is- its not a single word by grammatical rules (though in print, many books club the two words "कदा चित्" together) -

Look at this in MW, for example-

कदा चित्, at some time or other, sometimes, once [ID=42894.45]

So are the words like "कदा चन"-

न कदा चन, never at any time, RV. ; AV. &c. [ID=42894.4]

@gasyoun
Copy link
Member

gasyoun commented Jan 28, 2021

kadācid is- its not a single word by grammatical rules

I do know that. But many people still look for it as a single word. I would want to have it as an entry point.

@Andhrabharati
Copy link

Only way for this is to ignore the spaces in the "texts" to get such entries (and that was the way the manuscripts texts were, before the punctuation system [space, quote marks, exclamation & question marks, comma, ... ... ...] got introduced in Indian texts).

@gasyoun
Copy link
Member

gasyoun commented Jan 28, 2021

and that was the way the manuscripts texts were, before the punctuation system

No only Indian, same was in Latin until Middle ages.

funderburkjim added a commit that referenced this issue Feb 6, 2021
@funderburkjim
Copy link
Contributor Author

non-default spelling results limited on match

When using non-default input spelling, if the given spelling is found,
then the alternates are NOT shown. For example, azva with HK input spelling:
image

When using default-input spelling, all the dictionary matches shown:
image

This change makes semantic sense to me. What do others think?

@funderburkjim
Copy link
Contributor Author

tamilish alternates

Based on examples above:

These are 'solved':

  • poojana -> pūjana
  • pooja -> pūjā
  • dourbhagya -> daurbhāgya
  • dhoorikrutha -> dūrīkṛta

These are mentioned in comment, but not believed to be problems:

  • vipathitha -> vipatita vipāṭita vipattita is not in MW (or any other current dictionary)
  • natha still does not give natA (natA not in any dictionary) -- gives words starting with 'm'.
  • namaḥ still no verb. Looking for nam? And, namaḥ is not a verb form AFAIK.
  • varahsini no results. Note vārāśini not in mw or any other dictionary

Still no matches:

  • kadachid no results.
    • Phrases as well as 'very common inflected forms' should yield results. How to do such enhancement not clear.
      

cut/paste good results

I've got good results with small test of cut/paste of words from wikipedia.
Capitalization no longer a problem.

unwanted substitutions

There are still sometimes too many results, as with 'natha':

16 results: mātṛ mata nātha mātā naṭa nata maṭha naṭā nāṭa maṭa matha mathan mathā māṭha māta mātha

Allowing initial 'n' to be replaced by 'm' is the main culprit in this example.
With current program design, solution to this not obvious.

many skd spelling differences now resolved.

skd usually (always?) shows the nominative singular for substantive headwords.
Many (all?) of these are now handled.
Examples:

  • search mw for kartri: get kartṛ (as expected, and some others)
  • search skd for kartri: get kāritā kartra karttā kartrī (was expecting karttā in skd)
  • search skd for brahman (expecting brahmā)
    • 25 results: paramaḥ parama prāṇaḥ pramāṇaṃ vraṇa vraṇaḥ bhramaḥ bhrama pramā praṇāmaḥ bharaṇaḥ bharaṇaṃ bhramaṇaṃ varaṇaṃ varaṇaḥ prāṇanaṃ brahma varaṇā prāṇā braṇa varāṇaḥ vraṇahaḥ bhraṇa brahmā praṇaḥ
    • got result, but lot's of 'p' 'bh' and 'v' words also. Should these be removed from
      results?
  • brahman in mw: 28 results: brahman parama prāṇa pramāṇa praṇam vraṇa bhrama pramā praṇāma bharaṇa bhramaṇa varaṇa prāṇana brahma bhrāmaṇa varaṇā bhrāma praṇa vraṇaha vrahman praman vraṇana paramam paraṇa parāṇa varāṇa bharama vrāṇa

@gasyoun
Copy link
Member

gasyoun commented Feb 6, 2021

This change makes semantic sense to me. What do others think?

Agree. As an additional option it makes sense - when you know what you actually search for.

@funderburkjim
Copy link
Contributor Author

search time difference

There is a noticeable difference in search time between local machine and cologne.
Local machine (e.g. for brahman in mw) is almost instantaneous (< 1 second).
Cologne is about 8 seconds.

This is probably a combination of:

  • php 7.3.26 at Cologne, vs. php 8.0.0 on local machine
  • ssd differences

@gasyoun
Copy link
Member

gasyoun commented Feb 8, 2021

is almost instantaneous (< 1 second). Cologne is about 8 seconds.

And that is even after the ngrams are turned off? Cologne seems really slow on this.

brahman in mw: 28 results

Now we have a problem of over-generation.

dhoorikrutha -> dūrīkṛta

A dream come true. Thanks, @funderburkjim

@funderburkjim
Copy link
Contributor Author

p/b removed.

Removed this spelling equivalence in simplet Now brahman (default/mw) gives 16 results:

brahman vraṇa bhrama bhramaṇa bharaṇa varaṇa brahma bhrāmaṇa varaṇā bhrāma vraṇana 
vraṇaha vrāṇa vrahman varāṇa bharama

Still 8+ seconds at Cologne (are you also seeing slow times in Cologne search?)

When using prior version (/simple), same search for brahman takes about 2 seconds, and
gives 3 results: brahman brāhmaṇa vrahman

With simple and simplet search engines it is hard to know the 'cause' of the differences.

Current comparisons between the prior version (simple) and dev version (simplet)

  • simple is faster at Cologne
  • simple interprets 3rd parameter (if present) as 'output spelling'
  • simplet interprets 3rd parameter (if present) as 'input_simple' spelling assumption
    • simplet handles 'non-default' spellings such as simplet/mw/azva/hk
  • Capitalization: simplet (with input_simple = default) is better at cut-paste, especially with capitalization
    • EXAMPLE: simple LAKSHMI (no results in mw); simplet LAKSHMI -> lakṣmī lakṣmi
  • simple results better precision than simplet in some cases (like brahman above).

@gasyoun As SEO expert, how do you think we should proceed?
Should we make 'simplet' the current production version of simple-search?
If not, what needs to be done to simplet to get it ready for production?

@gasyoun
Copy link
Member

gasyoun commented Feb 9, 2021

Still 8+ seconds at Cologne (are you also seeing slow times in Cologne search?)

No, it did not look like 8 to me, quicker, close to 3 as your experience simple. Can we write the time in seconds for the SIMPLET queries, so one compare?

Should we make 'simplet' the current production version of simple-search?

I do not see no reason for why not. Speeding it up might take longer than expected.

If not, what needs to be done to simplet to get it ready for production?

Not only production - it's ready to go outside the simple folder, Jim.

@funderburkjim
Copy link
Contributor Author

Change .htaccess:

/simple/ now goes to version 1.1 (formerly called /simplet/)

/simple1.0/ now goes to version 1.0 (formerly called /simple/.

outside the simple folder

What does that mean?

@gasyoun
Copy link
Member

gasyoun commented Feb 10, 2021

What does that mean?

  1. add to all dispay dropdowns
  2. make default

rggrgret

@funderburkjim
Copy link
Contributor Author

While trivial to add a simple option to basicdisplay's input menu,
the implementation of the functionality is another matter.

It may be better to think of the whole system of basic, list, advanced-search displays as legacy applications, which will remain as is for the foreseeable future.

Indeed with simple-search, there is no need for basic or list, in my opinion.

However, basic, list, etc. do have the advantage that they can be easily installed as local applications.
By contrast, simple-search requires a lot more resources to do its job. Simple-search can be
installed locally but it's a much bigger commitment. When docker engines are as easy to
install as xampp, then we can replace the current local implementations of basic, etc. with
docker containers, and also perhaps have more flexibility to revise basic, etc. to include simple-search capabilities.

On the other side, Advanced search has some unique features that simple-search lacks:

  • substring matching for headwords
  • full-text searching (also has substring matching).

So I think I get your idea, but that it is premature to spend much time thinking about it.

@gasyoun
Copy link
Member

gasyoun commented Feb 15, 2021

Indeed with simple-search, there is no need for basic or list, in my opinion.

Agree. But adding it will hurt in no way. A big remake is big. Let's do the trivial.

Advanced search has some unique features that simple-search lacks

I do not see why these options can't be implemented in simple search.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants