Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MW accent correction #141

Closed
funderburkjim opened this issue Sep 19, 2022 · 50 comments
Closed

MW accent correction #141

funderburkjim opened this issue Sep 19, 2022 · 50 comments
Labels

Comments

@funderburkjim
Copy link
Contributor

In #140, it was mentioned that there are many errors in the coding of accents in the CDSl version of MW.
This issue devoted to correcting these errors.

It is reasonable to restrict to headwords.
The 'k2' (key2) field in the metaline shows accents.

107802 matches for "<k2>.*/" in buffer: mw.txt` udAtta accents.
114 matches for "<k2>.*\^" in buffer: mw.txt  svarita accents

In pwg,
470 matches for "<k2>.*\^" in buffer: pwg.txt
20809 matches for "<k2>.*/" in buffer: pwg.txt

In pw:
17929 matches for "<k2>.*/" in buffer: pw.txt
293 matches for "<k2>.*\^" in buffer: pw.txt

We can assume there should be consistency in accent
between MW and the Boehtlingk dictionaries (PW, PWG).

A reasonable first step might be to look at the svarita accents.
For instance:

pw: <L>12716<pc>1151-1<k1>asurya<k2>asurya^<e>100
mw: <L>21088<pc>121,2<k1>asurya<k2>asurya^<h>1<e>2

We could do such a comparison by program and print out the exceptions
for hand examination.

@funderburkjim
Copy link
Contributor Author

See #137 (comment) for another approach to detecting accent problems.

@drdhaval2785
Copy link
Contributor

Kindly look at the following entry SapaTya in both PWG and MW

PWG

<L>97768<pc>7-0062<k1>SapaTya<k2>SapaTya^
{#SapaTya^#}¦ (wie eben) <lex>adj.</lex> {%auf Fluch beruhend%} 
<ls>ṚV. 10, 97, 16.</ls>
<LEND>

PWG display

शपथ्य [Printed book page [7-0062](https://www.sanskrit-lexicon.uni-koeln.de/scans/csl-apidev/servepdf.php?dict=PWG&page=7-0062)]
शपथ्य॑ (wie eben) adj. auf Fluch beruhend [Ṛv. 10, 97, 16.](https://sanskrit-lexicon.github.io/rvlinks/rvhymns/rv10.097.html#rv10.097.16)                  [ID=97768]

MW

<L>212560<pc>1052,1<k1>SapaTya<k2>SapaTya/<e>2
<s>SapaTya/</s> ¦ <lex>mfn.</lex> depending on a curse, (a sin) consisting in cursing or imprecation, <ls>RV.</ls><info lex="m:f:n"/>
<LEND>

MW display


(H2) [Printed book page [1052](https://www.sanskrit-lexicon.uni-koeln.de/scans/csl-apidev/servepdf.php?dict=MW&page=1052),1]
शपथ्य॑ mfn. depending on a curse, (a sin) consisting in cursing or imprecation, RV.  [ID=212560]

Note the PWG SapaTya^ versus MW SapaTya/
SapaTya ends with 'a' in svarita.
This sure is confusing. For dictionaries PWG and MW, we should have consistent SLP1 encoding.

@drdhaval2785
Copy link
Contributor

MW typo examples

MW data

<L>212560<pc>1052,1<k1>SapaTya<k2>SapaTya/<e>2
<s>SapaTya/</s> ¦ <lex>mfn.</lex> depending on a curse, (a sin) consisting in cursing or imprecation, <ls>RV.</ls><info lex="m:f:n"/>
<LEND>
<L>212561<pc>1052,1<k1>Sapana<k2>Sa/pana<e>2
<s>Sa/pana</s> ¦ <lex>n.</lex> a curse, imprecation, <ls>AV.</ls><info lex="n"/>
<LEND>

MW snippet

Screenshot_2022-09-20_11-18-57

MW marked both with '/', whereas they are different.

PWG data

<L>97768<pc>7-0062<k1>SapaTya<k2>SapaTya^
{#SapaTya^#}¦ (wie eben) <lex>adj.</lex> {%auf Fluch beruhend%} 
<ls>ṚV. 10, 97, 16.</ls>
<LEND>

<L>97769<pc>7-0062<k1>Sapana<k2>Sa/pana
{#Sa/pana#}¦ (von {#Sap#}) <lex>n.</lex> = {#SapaTa#} 
<ls>AK. 1, 1, 5, 10.</ls> 
<ls>H. 262.</ls> {%Fluch%} 
<ls>TRIK. 3, 2, 9.</ls> 
<ls>AV. 1, 28, 3.</ls>
<LEND>

PWG snippet

Screenshot_2022-09-20_11-20-28

PWG shows them both to have different.

@Andhrabharati
Copy link
Contributor

Andhrabharati commented Sep 20, 2022

Good to see that @drdhaval2785 is also finding the 'inconsistency' issues across the dictionaries' data now.

A standing rule to be followed would be that whatever is the internal representation of the text (file) data is, the end result (display or otherwise) should tally with the printed matter, whether it is in the dictionary itself or in the reference work that it is citing from.

[I would have recommended Dhaval to post the citation matter from the RV and AV as well (as the case may be), to make the argument further strong/appealing. Ultimately they are the ones that are to be referred to, the dictionaries or others are just helping to reach them.]

@drdhaval2785
Copy link
Contributor

https://github.com/sanskrit-lexicon/MWS/blob/master/accent_diff/log.tsv - The highest priority accent differences
Entries are in headword AccentInMW AccentInPWG format

https://github.com/sanskrit-lexicon/MWS/blob/master/accent_diff/log.html - HTML can be downloaded and checked manually if needed.

@drdhaval2785
Copy link
Contributor

Once this is done, we can go to the next step.
That is because of the compound issues.

TSV file - https://github.com/sanskrit-lexicon/MWS/blob/master/accent_diff/log_with_compounds.tsv
HTML file - https://github.com/sanskrit-lexicon/MWS/blob/master/accent_diff/log_with_compounds.html

examples

aMSaBU a/MSaBU/ aMSaBU/

Here, the headword is a/MSa.
When it is used in compound, because of rules governing accent to compounds, it becomes aMSaBU/.
PWG correctly captures this.

As the compound parsing was done through some program in MW, the accent portion of it was not properly handled or could not be properly handled.
Therefore, it gave rise to a/MSaBU/ instead of aMSaBU/

We need to convert it back to aMSaBU/ as per PWG.
For headwords which occur in both MW and PWG, it may be an easy programmatic solution, but not sure how to handle the cases where there is no matching headword in PWG.

My hunch is that we should keep the accent as per the second part of the compound, and ignore the accent of first part of compound. Not sure.

@Andhrabharati
Copy link
Contributor

Andhrabharati commented Sep 20, 2022

My hunch is that we should keep the accent as per the second part of the compound, and ignore the accent of first part of compound. Not sure.

In general this is the principle to be followed, @drdhaval2785 ; in cases where the accent needs to be retained on the first part, the print has invariably mentioned it just before its (entry word's) lexical info (gender or otherwise).

I see that quite many of those portions are missing in the MW digitisation, though occasionally present scattered (just like the nom. case endings that I was talking about all these days).

@Andhrabharati
Copy link
Contributor

See @gasyoun , Dhaval has come out now with two lists (499 + 3169) counting to about 3600 entries, corroborating my estimate of more than couple of thousands as posted at #140 (comment).

[I am quite sure there still would be more entries in the text, that need to be identified and corrected.]

@drdhaval2785
Copy link
Contributor

Slight correction.
499 is subset of 3169, and not in addition thereto. So total 3169 diffferences.
Quite sizeable.

@Andhrabharati
Copy link
Contributor

Andhrabharati commented Sep 20, 2022

For headwords which occur in both MW and PWG, it may be an easy programmatic solution, but not sure how to handle the cases where there is no matching headword in PWG.

First option is to look for those in pwk, @drdhaval2785, and then in the pwkvn and also the "leftout" VN portions of PWG volumes 1-4 and 6. There are quite many in those pages, that did not come in PWG VN pages of Vol. 5 and Vol.7
(Jim was thinking the case to be otherwise with some random checks; I had checked all those entries and found that Jim was wrong, but did not pursue the matter with him! Much against my nature, to see the matter to reach its 'proper' end!!!)

@drdhaval2785
Copy link
Contributor

On cursory look at https://github.com/sanskrit-lexicon/MWS/blob/master/accent_diff/log.tsv, it seems that large chunk of it ends with ya. I am not sure whether there is some programmatic oddity which gave rise to this, or there is some grammatical rule which allows optional accents with words ending in suffices ending with ya.
Just noting it here, so that some grammatically inclined person can have a look.

@drdhaval2785
Copy link
Contributor

First option is to look for those in pwk, @drdhaval2785 (and then in the pwkvn and also the "leftout" VN portions of PWG volumes 1-4.

Seems a reasonable way.
So hierarchy is PWG -> PWK -> PWKVN -> PWGVN
Pardon my ignorance about PWKVN and PWGVN. Have not seen them at all.

@Andhrabharati
Copy link
Contributor

Andhrabharati commented Sep 20, 2022

Pardon my ignorance about PWKVN and PWGVN. Have not seen them at all.

These VN pages are the Additions and Corrections to pwk and PWG volumes [printed at the end of respective volume or in the later volume(s)].

pwkvn (of all the 7 volumes) is hosted under a separate repo, as Jim has his own reasons to not to club to the pwk text, as in the case of every other CDSL work. I had proposed him once to combine it and then left the matter.

After my pointing out the matter being missed altogether, there were some trials to derive the pwkvn data from SCH data, but finally it was decided to completely get those pages retyped. Jim seems to have funded (James Funderburk > Fund; hope Jim does not mind my saying thus) the digitisation expenses, as per Thomas.

PWG VN portions of Vol. 5 and Vol. 7 are after the PWG main portions of those two volumes respectively.
The other volumes' VN data is lying in some old version of PWG, which came out in my 'dugging' the old folders, and I had even posted the data completely 'proofed' ; they just amount to some 1000+ entries/lines.

@Andhrabharati
Copy link
Contributor

Andhrabharati commented Sep 20, 2022

Slight correction. 499 is subset of 3169, and not in addition thereto. So total 3169 diffferences. Quite sizeable.

I did not look at the items at all in the two lists, just seen the numbers.

Thought they would have been different, looking at this line--

Once this is done, we can go to the next step.

[I did not expect that the entries would have been repeated in another list, once 'done' in a list, as indicated in the above statement.]

@vvasuki
Copy link

vvasuki commented Sep 20, 2022

Good to see that @drdhaval2785 is also finding the 'inconsistency' issues across the dictionaries' data now.

A standing rule to be followed would be that whatever is the internal representation of the text (file) data is, the end result (display or otherwise) should tally with the printed matter, whether it is in the dictionary itself or in the reference work that it is citing from.

But, it should not be the only "end result", or there would be no question of devanAgarI headwords for MW etc.. It is desirable, as mentioned elsewhere (sanskrit-lexicon/csl-ldev#7 (comment)) to additionally (and prominently) show the accent in a standardized format.

On cursory look at https://github.com/sanskrit-lexicon/MWS/blob/master/accent_diff/log.tsv, it seems that large chunk of it ends with ya. I am not sure whether there is some programmatic oddity which gave rise to this, or there is some grammatical rule which allows optional accents with words ending in suffices ending with ya. Just noting it here, so that some grammatically inclined person can have a look.

@drdhaval2785 This is a matter of jAtya-svarita - or svarita arising from internal sandhi and not as a consequence of following an udAtta. In such words, instead of an udAtta setting the tone, you have a svarita. It occurs only after ya or va. For example, in case of shapathya, Bohtlingk-Sanskrit-Worterbuch-in-kurzerer-Fassung decyphers it as शपथि꣫अ . So, should be easy to detect programmatically.

@drdhaval2785
Copy link
Contributor

Good point raised @vvasuki . Thanks.

@vvasuki
Copy link

vvasuki commented Sep 20, 2022

Ultimately they are the ones that are to be referred to, the dictionaries or others are just helping to reach them.

An important point of clarification regarding the above. RV, SV, AV etc.. are NOT the only source of svara-s (as the bhAShyakAra says - it's impossible to list all sAdhu-shabda-s) - we have vyAkaraNa to deduce svara-s (which are incidentally a must in truly "proper" laukika speech as per shAstra). So, it is not inconceivable that the dict makers would have used the very same sUtras that we refer to today when determining svara-s in some cases.

@Andhrabharati
Copy link
Contributor

Andhrabharati commented Sep 20, 2022

For headwords which occur in both MW and PWG, it may be an easy programmatic solution, but not sure how to handle the cases where there is no matching headword in PWG.

First option is to look for those in pwk, @drdhaval2785, and then in the pwkvn and also the "leftout" VN portions of PWG volumes 1-4 and 6.

@funderburkjim
would you mind writing another "comparative display" program, to show MW | PWG | pwk + pwkvn (no need of having SCH in this case) in one screen, similar to https://sanskrit-lexicon.uni-koeln.de/scans/csl-apidev/pwkvn/03/?

[I was thinking of asking you this for many days now, but waiting for a suitable time.]

@gasyoun
Copy link
Member

gasyoun commented Sep 20, 2022

We can assume there should be consistency in accent between MW and the Boehtlingk dictionaries (PW, PWG).

As there is ERRATA not implemented, not always so.

and then in the pwkvn and also the "leftout" VN portions of PWG volumes 1-4 and 6.

Adresses that errata portion.

My hunch is that we should keep the accent as per the second part of the compound, and ignore the accent of first part of compound. Not sure.

And it is stated by Dhaval who did work on accent issues programmaticaly years ago.

I see that quite many of those portions are missing in the MW digitisation, though occasionally present scattered (just like the nom. case endings that I was talking about all these days).

A single sample?

[I am quite sure there still would be more entries in the text, that need to be identified and corrected.]

We do not care about the text yet, only headwords.

bhAShyakAra says - it's impossible to list all sAdhu-shabda-s

Can you trace the statement, please?

So, it is not inconceivable that the dict makers would have used the very same sUtras that we refer to today when determining svara-s in some cases.

They used only Vedic svaras, they did not determine NOTHING.

@vvasuki
Copy link

vvasuki commented Sep 21, 2022

So, it is not inconceivable that the dict makers would have used the very same sUtras that we refer to today when determining svara-s in some cases.

They used only Vedic svaras, they did not determine NOTHING.

How are you so sure? Whitney deals with svara-s quite well in his grammar. They would be dumb to not have used simple rules which they would have doubtless encountered via sAyaNa's commentary and native informants.

My hunch is that we should keep the accent as per the second part of the compound, and ignore the accent of first part of compound. Not sure.

And it is stated by Dhaval who did work on accent issues programmaticaly years ago.

Accent in compounds is not a "one rule for all" thing, if that's what you're talking about. Major rules are summarized here .

Best to show the accents of parts separately in case accent for the whole cannot be determined by lookup.

bhAShyakAra says - it's impossible to list all sAdhu-shabda-s

Can you trace the statement, please?

Vaguely recalled from paspashAhnika of mahAbhAShya, but could not find it exactly now. Just found this, which implies that - लघ्वर्थं चाध्येयं व्याकरणम् । 'ब्राह्मणेनावश्यं शब्दा ज्ञेयाः' इति, न चान्तरेण व्याकरणं लघुनोपायेन शब्दाः शक्या विज्ञातुम् ॥ . So, might be in kaiyaTa's comment.

@vvasuki
Copy link

vvasuki commented Sep 21, 2022

Can you trace the statement, please?

Vaguely recalled from paspashAhnika of mahAbhAShya, but could not find it exactly now. Just found this, which implies that - लघ्वर्थं चाध्येयं व्याकरणम् । 'ब्राह्मणेनावश्यं शब्दा ज्ञेयाः' इति, न चान्तरेण व्याकरणं लघुनोपायेन शब्दाः शक्या विज्ञातुम् ॥ . So, might be in kaiyaTa's comment.

Found it -

अथैतस्मिञ् शब्दोपदेशे सति किं शब्दानां प्रतिपत्तौ प्रतिपदपाठः कर्तव्यः - गौरश्वः पुरुषो हस्ती शकुनिर् मृगो ब्राह्मण इत्येवमादयः शब्दाः पठितव्याः ?
नेत्याह । अनभ्युपाय एष शब्दानां प्रतिपत्तौ प्रतिपदपाठः ॥ एवं हि श्रूयते - 'बृहस्पतिर् इन्द्राय दिव्यं वर्षसहस्रं प्रतिपदोक्तानां शब्दानां शब्दपारायणं प्रोवाच नान्तं जगाम' ॥ बृहस्पतिश्च प्रवक्ता, इन्द्रश्चाध्येता, दिव्यं वर्षसहस्रमध्ययनकालः, न चान्तं जगाम । किं पुनरद्यत्वे ? यः सर्वथा चिरं जीवति - वर्षशतं जीवति । चतुर्भिश् च प्रकारैर् विद्योपयुक्ता भवति - आगम-कालेन, स्वाध्याय-कालेन, प्रवचन-कालेन, व्यवहार-कालेनेति । तत्र चास्यागमकालेनैवायुः पर्युपयुक्तं स्यात् । तस्माद् अनभ्युपायः शब्दानां प्रतिपत्तौ प्रतिपदपाठः॥+++(4)+++
कथं तर्हीमे शब्दाः प्रतिपत्तव्याः? किंचित् सामान्य-विशेषवल्-लक्षणं प्रवर्त्यम् । येनाल्पेन यत्नेन महतो महतः शब्दौघान् प्रतिपद्येरन् ॥ किं पुनस् तत् ? उत्सर्गापवादौ । कश्चिदुत्सर्गः कर्तव्यः, कश्चिदपवादः ॥

@Andhrabharati
Copy link
Contributor

Andhrabharati commented Sep 21, 2022

For headwords which occur in both MW and PWG, it may be an easy programmatic solution, but not sure how to handle the cases where there is no matching headword in PWG.

I just recalled why I said I have no great expectations in programmatic approach to @funderburkjim in the other (parent) issue.

There are quite some entries that were to be corrected for accents in both PWG and pwk (& pwkvn).

So even if the metalines' k2 entries are compared between MW and these, those VN forms still remain uncaught, for those are lying in the body portion still, and not carried into the HW portion yet.

It was with my intervention that this correction has happened in just MW (last year), from its annexure data.

@drdhaval2785 and @funderburkjim may think of getting some means to cover this point in a programmatic way.

[I have some other points at the back of my mind, and would post subsequently at some time.]

@Andhrabharati
Copy link
Contributor

Andhrabharati commented Sep 21, 2022

A single sample

I see no use of giving any example, @gasyoun !!

I don't have to prove here that I have looked into the print (pdf) and text (cdsl file) data close enough.

@Andhrabharati
Copy link
Contributor

Andhrabharati commented Sep 22, 2022

Just opened the two log files by @drdhaval2785 , and noticed that neither of them contain the 'aMhu' entry.

MW has it as two parts
<L>126<pc>1,2<k1>aMhu<k2>aMhu<e>2
<L>127<pc>1,2<k1>aMhu<k2>aMhu<e>2B

Incidentally the 2nd part <L>127 is to be with the acute accent as per MW, not the 1st part <L>126 (which is with udAtta accent in PWG and pwk)
image

Also for future ref. (if it ever happens!), note the accent difference in the cross-referred word, <s>paro/-'Mhu</s> in MW (<L>126 body portion) as against {#paroMhu#} in PWG VN (<L>62430 body portion); and CDSL MW text has this word marked with acute mark (as compared to the print having the grave mark; PWG suggests no accent at all).

Though MW99 has picked up much of its data from the Boethlingk's dictionaries, undoubtedly it did take help from many other sources and also has some independent work (I would estimate the ratio as 75:25 roughly, for the above two portions); thus, we cannot always take Boethlingk as the ultimate authority!

PWG has-- <L>55<pc>1-0007<k1>aMhu<k2>aMhu/
image

pwk has-- <L>54<pc>1001-3<k1>aMhu<k2>aMhu/<e>100
image

Leaving the actual differences between MW and PWG/pwk accents (as above) aside, the main point is that, the 'logic' used in identifying the differences programmatically needs some 'refining'.

@Andhrabharati
Copy link
Contributor

Andhrabharati commented Sep 22, 2022

A single sample

I see no use of giving any example, @gasyoun !!

I don't have to prove here that I have looked into the print (pdf) and text (cdsl file) data close enough.

@gasyoun
Just not to leave your 'wish' unfulfilled, here is a case showing a correct accent and a wrong accent (wrt the print) in the (suggested) portions in CDSL text data (however neither of these got applied to the resp. HW!!)-

image

image

-----------------
@drdhaval2785 these are the (suggested) accents in the first part of the compound words in the MW print.

I was referring to all such cases, in my post above-- #141 (comment)

@Andhrabharati
Copy link
Contributor

Andhrabharati commented Sep 22, 2022

Another (minor) discrepancy of CDSL data wrt the print can be seen in the above snippet--
while the first word 116701 has the accent info preceding the lexical info, the second word 116702 has it following the lex. info!!

The print is consistently having the accent info before the lex. info all through its pages (there might be some cases in opposite, but they would surely be rare).

@Andhrabharati
Copy link
Contributor

Andhrabharati commented Sep 22, 2022

Slight correction.
499 is subset of 3169, and not in addition thereto. So total 3169 diffferences.
Quite sizeable.

@drdhaval2785, @gasyoun

Do you recall that at one time (about a year back?) I was saying that my estimate was that ~2% of HWs would be needing correction, which you were thinking to be 'a bit too much' of estimation (with so much of work/cleaning done over so many years)?

Whether it is the error in spelling or in accent, after all it is an error; and keep in mind that MW has wrote a spl. note about the accents in his introduction, citing its importance specifically in Sanskrit.

[I do not like to say this-- but as I am looking deep into the text, I am finding that more errors got introduced into the text, as against the cleaning part, esp. while tagging various entities; in summary, my feeling is that it was more of tagging that took place than cleaning the MW text.]

@Andhrabharati
Copy link
Contributor

Andhrabharati commented Sep 22, 2022

[I am quite sure there still would be more entries in the text, that need to be identified and corrected.]

We do not care about the text yet, only headwords.

Even I was talking about HWs only, @gasyoun [in my words, text is the typed matter (whether it is HWs part or the rest); when I mean the meaning(s) part, I would be specifically saying body portion]!!

funderburkjim added a commit to sanskrit-lexicon/csl-orig that referenced this issue Sep 26, 2022
funderburkjim added a commit that referenced this issue Sep 26, 2022
@funderburkjim
Copy link
Contributor Author

mw svarita corrections from pwg

This work was done in issue141 directory.

As mentioned above, there are many more metalines in pwg with svarita accents than in mw:

114 matches for "<k2>.*\^" in buffer: mw.txt  svarita accents
470 matches for "<k2>.*\^" in buffer: pwg.txt

The 114 in mw were compared to the printed mw, and a few corrections made. See change_mw_1.txt.
Then, for each of the 470 pwg entries, the corresponding mw entries were compared to printed mw,
and changes made. See change_mw_2.txt. There was also one typo corrected in pwg. The analysis uses variations of @drdhaval2785 's find_accent_diff.py .
After these changes to mw,

849 matches for "<k2>.*\^" in buffer: temp_mw_2.txt

svarita_mw_2.txt lists these metalines.

There are 81 additional cases (see See ad2arev.txt ) where pwg shows a svarita accent, but either

  • no corresponding MW headword is noted, OR
  • the corresponding MW headword has no accent. (and is not included in the 470)
    • The significance of this difference between pwg and mw is unknown (to me).

There are numerous (about 125) cases where MW has, in addition to a svarita-accented form, also an unaccented form. For example namasya:
image

An iast version of the revised (temp_mw_2.txt) mw: mw_2_svarita_iast.zip

@gasyoun
Copy link
Member

gasyoun commented Sep 26, 2022

There are 81 additional case

Interesting indeed, so MW is not a pure copycat.

@funderburkjim
Copy link
Contributor Author

possible next step: inheritance

71873 matches for "<k2>.*[\/^].*[-—]" in buffer: temp_mw_2.txt
For example, aMSa has an accent, and this accent is, in the CDSL coding, 'inherited' by
compounds of aMSa.

<L>10<pc>1,1<k1>aMSa<k2>a/MSa<e>1
 ...
<L>20<pc>1,1<k1>aMSakaraRa<k2>a/MSa—karaRa<e>3
<L>21<pc>1,1<k1>aMSakalpanA<k2>a/MSa—kalpanA<e>3
<L>22<pc>1,1<k1>aMSaprakalpanA<k2>a/MSa—prakalpanA<e>3
 etc.

image

I think this 'accent inheritance in compounds' principle of CDSL is likely wrong in general. For instance
<k2>a/MSa—karaRa should be changed to <k2>aMSa—karaRa (remove accent).

Should the principle be?
Always remove inherited accents in compounds unless MW specifically says to use them.
For example, the inherited accent 'sva^r' should be removed in svargiri and svarjit,
but retained in svarcakzas and svarcanas:

<L>259095<pc>1281,1<k1>svar<k2>sva^r<h>4<e>2
 ...
<L>259109<pc>1281,2<k1>svargiri<k2>sva/r—giri<e>3   to change to svar—giri
<L>259110<pc>1281,2<k1>svarcakzas<k2>sva^r—cakzas<e>3  ok
<L>259111<pc>1281,2<k1>svarcanas<k2>sva^r—canas<e>3   ok
L>259112<pc>1281,2<k1>svarjit<k2>sva/r—ji/t<h>a<e>3    to change to svar—ji/t

image

@vvasuki
Copy link

vvasuki commented Sep 27, 2022

71873 matches for "<k2>.*[\/^].*[-—]" in buffer: temp_mw_2.txt For example, aMSa has an accent, and this accent is, in the CDSL coding, 'inherited' by compounds of aMSa.

<L>10<pc>1,1<k1>aMSa<k2>a/MSa<e>1
 ...
<L>20<pc>1,1<k1>aMSakaraRa<k2>a/MSa—karaRa<e>3
<L>21<pc>1,1<k1>aMSakalpanA<k2>a/MSa—kalpanA<e>3
<L>22<pc>1,1<k1>aMSaprakalpanA<k2>a/MSa—prakalpanA<e>3
 etc.

image

I think this 'accent inheritance in compounds' principle of CDSL is likely wrong in general. For instance <k2>a/MSa—karaRa should be changed to <k2>aMSa—karaRa (remove accent).

In all these particular cases, accent actually would lie in the second part of the compound. For bahuvrIhi compounds (and a few other exceptions), the first constituent's accent would be retained. This is not possible to determine programmatically. So, indeed, it is a good idea to remove accent from both parts <k2>aMSa—karaRa. However, for the convenience of those who care for accents, it would be a great idea to provide independent accents for both the constituents (by adding a string like (a/MSa + ka/raRa)), so that they can work out the final accent of the compound in their heads.

Of course, the accent of the first constituent is easily available, and that of the second part may or may not be determined without ambiguity by a further lookup (eg. both करण॑ and क॑रण exist). So, the accent of the second part can be shown only in unambiguous cases.

Should the principle be? Always remove inherited accents in compounds unless MW specifically says to use them. For example, the inherited accent 'sva^r' should be removed in svargiri and svarjit, but retained in svarcakzas and svarcanas:

Sounds like a good idea!

@Andhrabharati
Copy link
Contributor

There are 81 additional case

Interesting indeed, so MW is not a pure copycat.

See my post above at #141 (comment)

Though MW99 has picked up much of its data from the Boethlingk's dictionaries, undoubtedly it did take help from many other sources and also has some independent work (I would estimate the ratio as 75:25 roughly, for the above two portions); thus, we cannot always take Boethlingk as the ultimate authority!

"and also has some independent work"

@Andhrabharati
Copy link
Contributor

Should the principle be?
Always remove inherited accents in compounds unless MW specifically says to use them.
For example, the inherited accent 'sva^r' should be removed in svargiri and svarjit,
but retained in svarcakzas and svarcanas:

I had posted several messages above, on the same point--

#141 (comment)

#141 (comment)

#141 (comment)

@Andhrabharati
Copy link
Contributor

There are numerous (about 125) cases where MW has, in addition to a svarita-accented form, also an unaccented form.

That is how it is!
The accent would change at different contexts, and also at different 'lexical' forms.
[Sometimes, even the same lexical form could be having different accents!]

@gasyoun
Copy link
Member

gasyoun commented Sep 27, 2022

it would be a great idea to provide independent accents for both the constituents (by adding a string like (a/MSa + ka/raRa)), so that they can work out the final accent of the compound in their heads.

@Andhrabharati that does not make much sense to me. To give something wrong, that one needs to recalculate in his head.

So, the accent of the second part can be shown only in unambiguous cases.

Programmatically?

75:25 roughly

Missed that one before.

@vvasuki
Copy link

vvasuki commented Sep 28, 2022

it would be a great idea to provide independent accents for both the constituents (by adding a string like (a/MSa + ka/raRa)), so that they can work out the final accent of the compound in their heads.

@Andhrabharati that does not make much sense to me. To give something wrong, that one needs to recalculate in his head.

Saying Water (←H₂ + O₂) instead of H₂O water became wrong since when?

So, the accent of the second part can be shown only in unambiguous cases.

Programmatically?

Yes

75:25 roughly

Missed that one before.

funderburkjim added a commit to sanskrit-lexicon/csl-corrections that referenced this issue Oct 2, 2022
funderburkjim added a commit to sanskrit-lexicon/csl-orig that referenced this issue Oct 2, 2022
funderburkjim added a commit that referenced this issue Oct 2, 2022
@funderburkjim
Copy link
Contributor Author

Phase 2

The focus here is on the MW headwords whose 'k2' differs from PWG, where PWG has an udAtta accent, and where MW has non-samAsa entries.
(i.e., similar to prior phase, except here udAtta and prior phase was svarita).

The work is still in the issue141 directory. The mw change transactions are in
change_mw_3.txt (about 600 lines changed) . Details can be seen in the commit above.

Expand k2 syntax

There are headwords where two accented variants are presented.

<L>6230<pc>32,1<k1>anugra<k2>a/n-ugra,an-ugra/<e>1    <<< NOTE THE COMMA
<s>a/n-ugra</s> or <s>an-ugra/</s> ¦ <lex>mf(<s>A</s>)n.</lex> not harsh or violent, mild, gentle, <ls>RV.</ls> &c.<info lex="m:f#A:n"/><info or="6230,anugra"/>

It was convenient to extend the metaline convention to allow a comma-delimited list for k2. See the sections singleton_or_and changes and
temp_singleton_k2changes of change_mw_3. This resolved several of the udAtta accent differences with PWG.
At this point, there were 350+ mw entries to compare with pwg (see ad3_rev.txt). The mw print was examined by hand, and the CDSL k2 markup classified as '+' (200+ CDSL agrees with print) or 'x' CDSL k2 markup may disagree with print (160+ cases).
Then changes were made for the 'x' cases, see temp_change_mw_3b.txt section of change_mw_3.txt. After all the changes, there remain about 275 cases with udAtta accents classified as differing from PWG (out of about 5000 cases). These are shown
in file ad3b_rev.txt.

some rules

As the task progressed, I tried to develop rules to handle cases where the accent(s) in mw is not obvious, but requires some sort of inference. Sometimes, these rules are referenced in change_mw_3 (e.g. 50+ instances of Rule 1). The rules are:

  • Rule 1: only one accent per headword. Drop accent inherited from parent.
  • Rule 2: parent Xa/Ya Child (<s>am</s>) Xa/yam (i.e., inherited). Example uttaram
  • Rule 3: do not inherit accent in compound (similar to Rule 1)

Interested parties may wish to examine (in change_mw_3) instances of these rules.

Thus far, I have found in mw print only one exception to the only one accent per headword rule. tAjadBaNga, and I changed that to agree with pwg and noted as
an 'mw print change' .

image

next step

samAsa correction in mw.

funderburkjim added a commit to sanskrit-lexicon/csl-orig that referenced this issue Oct 12, 2022
funderburkjim added a commit to sanskrit-lexicon/csl-orig that referenced this issue Oct 12, 2022
funderburkjim added a commit to sanskrit-lexicon/csl-orig that referenced this issue Oct 17, 2022
@funderburkjim
Copy link
Contributor Author

a long road

The 'programmatic' mw accent corrections appear to me to be at an end. Further corrections require
manual review of mw.txt with the scans for all pages.
I've started this with pages 1-59.
Changes are in change_mw_6.txt.
The time required for these 59 pages was about 3 days, or 20 pages per day.

At this rate, the total cleanup remaining will require 2-3 months.

@Andhrabharati
Copy link
Contributor

Andhrabharati commented Oct 17, 2022

I had 'sensed' this, much before starting the programmatic approach!!

If the latest iast file is made, I might be able to help in the next portion of the corrections. (after a few days probably)

I had also noticed some pc errors in the metalines, that could also be covered in the manual checking of HWs.

@funderburkjim
Copy link
Contributor Author

@Andhrabharati Request you to do some random checking of the first batch of changes above, in case I need to make any mid-course corrections in method.

The main non-accent change in metalines that I've noticed is with the 'pc' value for the last item in
a column. Quite often, the pc for this item incorrectly refers to the next column, and thus requires correction.

I'm also not examining the VN entries, since I believe you have previously corrected these, and I found no required corrections in the first few VN.

@gasyoun
Copy link
Member

gasyoun commented Oct 21, 2022

275 cases with udAtta accents classified as differing from PWG (out of about 5000 cases)

It will close the day when the Reverse Dictionary might get published thanks to such cleanup rounds.

pc for this item incorrectly refers to the next column, and thus requires correction.

Interesting to note

total cleanup remaining will require 2-3 months.

Major Tom calling for @Andhrabharati ))

@Andhrabharati
Copy link
Contributor

@funderburkjim appears to have decided to work it out himself!!

[I had asked him to make the IAST file to do it; but he instead chose to continue the process with slp1, and has opened a new (continuation) issue]

@Andhrabharati
Copy link
Contributor

Andhrabharati commented Oct 21, 2022

And interestingly, seen that he is also filling up (some, if not all, of) the nom. case endings that I was talking about all these days for the past two years, that are missed/truncated in the current CDSL MW data!!

Probably, I might be able to do a full checkup once he finishes the process; though it takes his time, it definitely is a worthy spending at his end.

@Andhrabharati
Copy link
Contributor

Probably @funderburkjim might close this issue, as another "continuation" issue is taken up now.

@funderburkjim
Copy link
Contributor Author

nom. case endings

I am trying to do that mostly when it seems to give additional information for entries whose base form has an accent. One example is under uzRa/.

image

The (<s>as</s>) at the masculine form seems to give additional information (e.g., the masculine nom. singular is <s>uzRas</s> (we would write with visarga <s>uzRaH</s> but that's beside the point).)
This is instead of the possibly expected <s>uzRa/s</s> .
So, most of the nominative case endings added by me are like this.

Here is an example where I didn't add back the nom. singular form.
image

'as' here is the normal nominative singular ending for a masculine noun whose citation endings in 'a'.
And MW seems to me to be inconsistent in inserting the 'as'. For instance, there is no 'as' in uzmaka.
image

There would be no objection from me if, in his later review of mw, @Andhrabharati, he decides to be more thorough in adding to mw.txt the nominative endings which remain missing in the digitization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants