MW accent correction #141

funderburkjim · 2022-09-19T18:11:20Z

In #140, it was mentioned that there are many errors in the coding of accents in the CDSl version of MW.
This issue devoted to correcting these errors.

It is reasonable to restrict to headwords.
The 'k2' (key2) field in the metaline shows accents.

107802 matches for "<k2>.*/" in buffer: mw.txt` udAtta accents.
114 matches for "<k2>.*\^" in buffer: mw.txt  svarita accents

In pwg,
470 matches for "<k2>.*\^" in buffer: pwg.txt
20809 matches for "<k2>.*/" in buffer: pwg.txt

In pw:
17929 matches for "<k2>.*/" in buffer: pw.txt
293 matches for "<k2>.*\^" in buffer: pw.txt

We can assume there should be consistency in accent
between MW and the Boehtlingk dictionaries (PW, PWG).

A reasonable first step might be to look at the svarita accents.
For instance:

pw: <L>12716<pc>1151-1<k1>asurya<k2>asurya^<e>100
mw: <L>21088<pc>121,2<k1>asurya<k2>asurya^<h>1<e>2

We could do such a comparison by program and print out the exceptions
for hand examination.

The text was updated successfully, but these errors were encountered:

funderburkjim · 2022-09-19T18:16:30Z

See #137 (comment) for another approach to detecting accent problems.

drdhaval2785 · 2022-09-20T05:47:06Z

Kindly look at the following entry SapaTya in both PWG and MW

PWG

<L>97768<pc>7-0062<k1>SapaTya<k2>SapaTya^
{#SapaTya^#}¦ (wie eben) <lex>adj.</lex> {%auf Fluch beruhend%} 
<ls>ṚV. 10, 97, 16.</ls>
<LEND>

PWG display

शपथ्य [Printed book page [7-0062](https://www.sanskrit-lexicon.uni-koeln.de/scans/csl-apidev/servepdf.php?dict=PWG&page=7-0062)]
शपथ्य॑ (wie eben) adj. auf Fluch beruhend [Ṛv. 10, 97, 16.](https://sanskrit-lexicon.github.io/rvlinks/rvhymns/rv10.097.html#rv10.097.16)                  [ID=97768]

MW

<L>212560<pc>1052,1<k1>SapaTya<k2>SapaTya/<e>2
<s>SapaTya/</s> ¦ <lex>mfn.</lex> depending on a curse, (a sin) consisting in cursing or imprecation, <ls>RV.</ls><info lex="m:f:n"/>
<LEND>

MW display


(H2) [Printed book page [1052](https://www.sanskrit-lexicon.uni-koeln.de/scans/csl-apidev/servepdf.php?dict=MW&page=1052),1]
शपथ्य॑ mfn. depending on a curse, (a sin) consisting in cursing or imprecation, RV.  [ID=212560]

Note the PWG SapaTya^ versus MW SapaTya/
SapaTya ends with 'a' in svarita.
This sure is confusing. For dictionaries PWG and MW, we should have consistent SLP1 encoding.

drdhaval2785 · 2022-09-20T05:52:06Z

MW typo examples

MW data

<L>212560<pc>1052,1<k1>SapaTya<k2>SapaTya/<e>2
<s>SapaTya/</s> ¦ <lex>mfn.</lex> depending on a curse, (a sin) consisting in cursing or imprecation, <ls>RV.</ls><info lex="m:f:n"/>
<LEND>
<L>212561<pc>1052,1<k1>Sapana<k2>Sa/pana<e>2
<s>Sa/pana</s> ¦ <lex>n.</lex> a curse, imprecation, <ls>AV.</ls><info lex="n"/>
<LEND>

MW snippet

MW marked both with '/', whereas they are different.

PWG data

<L>97768<pc>7-0062<k1>SapaTya<k2>SapaTya^
{#SapaTya^#}¦ (wie eben) <lex>adj.</lex> {%auf Fluch beruhend%} 
<ls>ṚV. 10, 97, 16.</ls>
<LEND>

<L>97769<pc>7-0062<k1>Sapana<k2>Sa/pana
{#Sa/pana#}¦ (von {#Sap#}) <lex>n.</lex> = {#SapaTa#} 
<ls>AK. 1, 1, 5, 10.</ls> 
<ls>H. 262.</ls> {%Fluch%} 
<ls>TRIK. 3, 2, 9.</ls> 
<ls>AV. 1, 28, 3.</ls>
<LEND>

PWG snippet

PWG shows them both to have different.

Andhrabharati · 2022-09-20T06:57:34Z

Good to see that @drdhaval2785 is also finding the 'inconsistency' issues across the dictionaries' data now.

A standing rule to be followed would be that whatever is the internal representation of the text (file) data is, the end result (display or otherwise) should tally with the printed matter, whether it is in the dictionary itself or in the reference work that it is citing from.

[I would have recommended Dhaval to post the citation matter from the RV and AV as well (as the case may be), to make the argument further strong/appealing. Ultimately they are the ones that are to be referred to, the dictionaries or others are just helping to reach them.]

drdhaval2785 · 2022-09-20T07:24:26Z

https://github.com/sanskrit-lexicon/MWS/blob/master/accent_diff/log.tsv - The highest priority accent differences
Entries are in headword AccentInMW AccentInPWG format

https://github.com/sanskrit-lexicon/MWS/blob/master/accent_diff/log.html - HTML can be downloaded and checked manually if needed.

drdhaval2785 · 2022-09-20T07:38:42Z

Once this is done, we can go to the next step.
That is because of the compound issues.

TSV file - https://github.com/sanskrit-lexicon/MWS/blob/master/accent_diff/log_with_compounds.tsv
HTML file - https://github.com/sanskrit-lexicon/MWS/blob/master/accent_diff/log_with_compounds.html

examples

aMSaBU	a/MSaBU/	aMSaBU/

Here, the headword is a/MSa.
When it is used in compound, because of rules governing accent to compounds, it becomes aMSaBU/.
PWG correctly captures this.

As the compound parsing was done through some program in MW, the accent portion of it was not properly handled or could not be properly handled.
Therefore, it gave rise to a/MSaBU/ instead of aMSaBU/

We need to convert it back to aMSaBU/ as per PWG.
For headwords which occur in both MW and PWG, it may be an easy programmatic solution, but not sure how to handle the cases where there is no matching headword in PWG.

My hunch is that we should keep the accent as per the second part of the compound, and ignore the accent of first part of compound. Not sure.

Andhrabharati · 2022-09-20T07:49:56Z

My hunch is that we should keep the accent as per the second part of the compound, and ignore the accent of first part of compound. Not sure.

In general this is the principle to be followed, @drdhaval2785 ; in cases where the accent needs to be retained on the first part, the print has invariably mentioned it just before its (entry word's) lexical info (gender or otherwise).

I see that quite many of those portions are missing in the MW digitisation, though occasionally present scattered (just like the nom. case endings that I was talking about all these days).

Andhrabharati · 2022-09-20T07:57:39Z

See @gasyoun , Dhaval has come out now with two lists (499 + 3169) counting to about 3600 entries, corroborating my estimate of more than couple of thousands as posted at #140 (comment).

[I am quite sure there still would be more entries in the text, that need to be identified and corrected.]

drdhaval2785 · 2022-09-20T08:13:53Z

Slight correction.
499 is subset of 3169, and not in addition thereto. So total 3169 diffferences.
Quite sizeable.

Andhrabharati · 2022-09-20T08:18:30Z

For headwords which occur in both MW and PWG, it may be an easy programmatic solution, but not sure how to handle the cases where there is no matching headword in PWG.

First option is to look for those in pwk, @drdhaval2785, and then in the pwkvn and also the "leftout" VN portions of PWG volumes 1-4 and 6. There are quite many in those pages, that did not come in PWG VN pages of Vol. 5 and Vol.7
(Jim was thinking the case to be otherwise with some random checks; I had checked all those entries and found that Jim was wrong, but did not pursue the matter with him! Much against my nature, to see the matter to reach its 'proper' end!!!)

drdhaval2785 · 2022-09-20T08:18:54Z

On cursory look at https://github.com/sanskrit-lexicon/MWS/blob/master/accent_diff/log.tsv, it seems that large chunk of it ends with ya. I am not sure whether there is some programmatic oddity which gave rise to this, or there is some grammatical rule which allows optional accents with words ending in suffices ending with ya.
Just noting it here, so that some grammatically inclined person can have a look.

drdhaval2785 · 2022-09-20T08:20:50Z

First option is to look for those in pwk, @drdhaval2785 (and then in the pwkvn and also the "leftout" VN portions of PWG volumes 1-4.

Seems a reasonable way.
So hierarchy is PWG -> PWK -> PWKVN -> PWGVN
Pardon my ignorance about PWKVN and PWGVN. Have not seen them at all.

Andhrabharati · 2022-09-20T08:26:48Z

Pardon my ignorance about PWKVN and PWGVN. Have not seen them at all.

These VN pages are the Additions and Corrections to pwk and PWG volumes [printed at the end of respective volume or in the later volume(s)].

pwkvn (of all the 7 volumes) is hosted under a separate repo, as Jim has his own reasons to not to club to the pwk text, as in the case of every other CDSL work. I had proposed him once to combine it and then left the matter.

After my pointing out the matter being missed altogether, there were some trials to derive the pwkvn data from SCH data, but finally it was decided to completely get those pages retyped. Jim seems to have funded (James Funderburk > Fund; hope Jim does not mind my saying thus) the digitisation expenses, as per Thomas.

PWG VN portions of Vol. 5 and Vol. 7 are after the PWG main portions of those two volumes respectively.
The other volumes' VN data is lying in some old version of PWG, which came out in my 'dugging' the old folders, and I had even posted the data completely 'proofed' ; they just amount to some 1000+ entries/lines.

Andhrabharati · 2022-09-20T08:38:43Z

Slight correction. 499 is subset of 3169, and not in addition thereto. So total 3169 diffferences. Quite sizeable.

I did not look at the items at all in the two lists, just seen the numbers.

Thought they would have been different, looking at this line--

Once this is done, we can go to the next step.

[I did not expect that the entries would have been repeated in another list, once 'done' in a list, as indicated in the above statement.]

vvasuki · 2022-09-20T11:13:36Z

Good to see that @drdhaval2785 is also finding the 'inconsistency' issues across the dictionaries' data now.

A standing rule to be followed would be that whatever is the internal representation of the text (file) data is, the end result (display or otherwise) should tally with the printed matter, whether it is in the dictionary itself or in the reference work that it is citing from.

But, it should not be the only "end result", or there would be no question of devanAgarI headwords for MW etc.. It is desirable, as mentioned elsewhere (sanskrit-lexicon/csl-ldev#7 (comment)) to additionally (and prominently) show the accent in a standardized format.

On cursory look at https://github.com/sanskrit-lexicon/MWS/blob/master/accent_diff/log.tsv, it seems that large chunk of it ends with ya. I am not sure whether there is some programmatic oddity which gave rise to this, or there is some grammatical rule which allows optional accents with words ending in suffices ending with ya. Just noting it here, so that some grammatically inclined person can have a look.

@drdhaval2785 This is a matter of jAtya-svarita - or svarita arising from internal sandhi and not as a consequence of following an udAtta. In such words, instead of an udAtta setting the tone, you have a svarita. It occurs only after ya or va. For example, in case of shapathya, Bohtlingk-Sanskrit-Worterbuch-in-kurzerer-Fassung decyphers it as शपथि꣫अ . So, should be easy to detect programmatically.

drdhaval2785 · 2022-09-20T11:18:47Z

Good point raised @vvasuki . Thanks.

vvasuki · 2022-09-20T11:45:46Z

Ultimately they are the ones that are to be referred to, the dictionaries or others are just helping to reach them.

An important point of clarification regarding the above. RV, SV, AV etc.. are NOT the only source of svara-s (as the bhAShyakAra says - it's impossible to list all sAdhu-shabda-s) - we have vyAkaraNa to deduce svara-s (which are incidentally a must in truly "proper" laukika speech as per shAstra). So, it is not inconceivable that the dict makers would have used the very same sUtras that we refer to today when determining svara-s in some cases.

Andhrabharati · 2022-09-20T13:44:56Z

For headwords which occur in both MW and PWG, it may be an easy programmatic solution, but not sure how to handle the cases where there is no matching headword in PWG.

First option is to look for those in pwk, @drdhaval2785, and then in the pwkvn and also the "leftout" VN portions of PWG volumes 1-4 and 6.

@funderburkjim
would you mind writing another "comparative display" program, to show MW | PWG | pwk + pwkvn (no need of having SCH in this case) in one screen, similar to https://sanskrit-lexicon.uni-koeln.de/scans/csl-apidev/pwkvn/03/?

[I was thinking of asking you this for many days now, but waiting for a suitable time.]

gasyoun · 2022-09-20T20:45:45Z

We can assume there should be consistency in accent between MW and the Boehtlingk dictionaries (PW, PWG).

As there is ERRATA not implemented, not always so.

and then in the pwkvn and also the "leftout" VN portions of PWG volumes 1-4 and 6.

Adresses that errata portion.

My hunch is that we should keep the accent as per the second part of the compound, and ignore the accent of first part of compound. Not sure.

And it is stated by Dhaval who did work on accent issues programmaticaly years ago.

I see that quite many of those portions are missing in the MW digitisation, though occasionally present scattered (just like the nom. case endings that I was talking about all these days).

A single sample?

[I am quite sure there still would be more entries in the text, that need to be identified and corrected.]

We do not care about the text yet, only headwords.

bhAShyakAra says - it's impossible to list all sAdhu-shabda-s

Can you trace the statement, please?

So, it is not inconceivable that the dict makers would have used the very same sUtras that we refer to today when determining svara-s in some cases.

They used only Vedic svaras, they did not determine NOTHING.

vvasuki · 2022-09-21T01:15:12Z

So, it is not inconceivable that the dict makers would have used the very same sUtras that we refer to today when determining svara-s in some cases.

They used only Vedic svaras, they did not determine NOTHING.

How are you so sure? Whitney deals with svara-s quite well in his grammar. They would be dumb to not have used simple rules which they would have doubtless encountered via sAyaNa's commentary and native informants.

My hunch is that we should keep the accent as per the second part of the compound, and ignore the accent of first part of compound. Not sure.

And it is stated by Dhaval who did work on accent issues programmaticaly years ago.

Accent in compounds is not a "one rule for all" thing, if that's what you're talking about. Major rules are summarized here .

Best to show the accents of parts separately in case accent for the whole cannot be determined by lookup.

bhAShyakAra says - it's impossible to list all sAdhu-shabda-s

Can you trace the statement, please?

Vaguely recalled from paspashAhnika of mahAbhAShya, but could not find it exactly now. Just found this, which implies that - लघ्वर्थं चाध्येयं व्याकरणम् । 'ब्राह्मणेनावश्यं शब्दा ज्ञेयाः' इति, न चान्तरेण व्याकरणं लघुनोपायेन शब्दाः शक्या विज्ञातुम् ॥ . So, might be in kaiyaTa's comment.

vvasuki · 2022-09-21T01:44:20Z

Can you trace the statement, please?

Vaguely recalled from paspashAhnika of mahAbhAShya, but could not find it exactly now. Just found this, which implies that - लघ्वर्थं चाध्येयं व्याकरणम् । 'ब्राह्मणेनावश्यं शब्दा ज्ञेयाः' इति, न चान्तरेण व्याकरणं लघुनोपायेन शब्दाः शक्या विज्ञातुम् ॥ . So, might be in kaiyaTa's comment.

Found it -

अथैतस्मिञ् शब्दोपदेशे सति किं शब्दानां प्रतिपत्तौ प्रतिपदपाठः कर्तव्यः - गौरश्वः पुरुषो हस्ती शकुनिर् मृगो ब्राह्मण इत्येवमादयः शब्दाः पठितव्याः ?
नेत्याह । अनभ्युपाय एष शब्दानां प्रतिपत्तौ प्रतिपदपाठः ॥ एवं हि श्रूयते - 'बृहस्पतिर् इन्द्राय दिव्यं वर्षसहस्रं प्रतिपदोक्तानां शब्दानां शब्दपारायणं प्रोवाच नान्तं जगाम' ॥ बृहस्पतिश्च प्रवक्ता, इन्द्रश्चाध्येता, दिव्यं वर्षसहस्रमध्ययनकालः, न चान्तं जगाम । किं पुनरद्यत्वे ? यः सर्वथा चिरं जीवति - वर्षशतं जीवति । चतुर्भिश् च प्रकारैर् विद्योपयुक्ता भवति - आगम-कालेन, स्वाध्याय-कालेन, प्रवचन-कालेन, व्यवहार-कालेनेति । तत्र चास्यागमकालेनैवायुः पर्युपयुक्तं स्यात् । तस्माद् अनभ्युपायः शब्दानां प्रतिपत्तौ प्रतिपदपाठः॥+++(4)+++
कथं तर्हीमे शब्दाः प्रतिपत्तव्याः? किंचित् सामान्य-विशेषवल्-लक्षणं प्रवर्त्यम् । येनाल्पेन यत्नेन महतो महतः शब्दौघान् प्रतिपद्येरन् ॥ किं पुनस् तत् ? उत्सर्गापवादौ । कश्चिदुत्सर्गः कर्तव्यः, कश्चिदपवादः ॥

Andhrabharati · 2022-09-21T08:34:12Z

For headwords which occur in both MW and PWG, it may be an easy programmatic solution, but not sure how to handle the cases where there is no matching headword in PWG.

I just recalled why I said I have no great expectations in programmatic approach to @funderburkjim in the other (parent) issue.

There are quite some entries that were to be corrected for accents in both PWG and pwk (& pwkvn).

So even if the metalines' k2 entries are compared between MW and these, those VN forms still remain uncaught, for those are lying in the body portion still, and not carried into the HW portion yet.

It was with my intervention that this correction has happened in just MW (last year), from its annexure data.

@drdhaval2785 and @funderburkjim may think of getting some means to cover this point in a programmatic way.

[I have some other points at the back of my mind, and would post subsequently at some time.]

Andhrabharati · 2022-09-21T08:41:18Z

A single sample

I see no use of giving any example, @gasyoun !!

I don't have to prove here that I have looked into the print (pdf) and text (cdsl file) data close enough.

Andhrabharati · 2022-09-22T08:34:19Z

Just opened the two log files by @drdhaval2785 , and noticed that neither of them contain the 'aMhu' entry.

MW has it as two parts
<L>126<pc>1,2<k1>aMhu<k2>aMhu<e>2
<L>127<pc>1,2<k1>aMhu<k2>aMhu<e>2B

Incidentally the 2nd part <L>127 is to be with the acute accent as per MW, not the 1st part <L>126 (which is with udAtta accent in PWG and pwk)

Also for future ref. (if it ever happens!), note the accent difference in the cross-referred word, <s>paro/-'Mhu</s> in MW (<L>126 body portion) as against {#paroMhu#} in PWG VN (<L>62430 body portion); and CDSL MW text has this word marked with acute mark (as compared to the print having the grave mark; PWG suggests no accent at all).

Though MW99 has picked up much of its data from the Boethlingk's dictionaries, undoubtedly it did take help from many other sources and also has some independent work (I would estimate the ratio as 75:25 roughly, for the above two portions); thus, we cannot always take Boethlingk as the ultimate authority!

PWG has-- <L>55<pc>1-0007<k1>aMhu<k2>aMhu/

pwk has-- <L>54<pc>1001-3<k1>aMhu<k2>aMhu/<e>100

Leaving the actual differences between MW and PWG/pwk accents (as above) aside, the main point is that, the 'logic' used in identifying the differences programmatically needs some 'refining'.

Andhrabharati · 2022-09-22T09:17:33Z

A single sample

I see no use of giving any example, @gasyoun !!

I don't have to prove here that I have looked into the print (pdf) and text (cdsl file) data close enough.

@gasyoun
Just not to leave your 'wish' unfulfilled, here is a case showing a correct accent and a wrong accent (wrt the print) in the (suggested) portions in CDSL text data (however neither of these got applied to the resp. HW!!)-

-----------------
@drdhaval2785 these are the (suggested) accents in the first part of the compound words in the MW print.

I was referring to all such cases, in my post above-- #141 (comment)

Andhrabharati · 2022-09-22T09:41:52Z

Another (minor) discrepancy of CDSL data wrt the print can be seen in the above snippet--
while the first word 116701 has the accent info preceding the lexical info, the second word 116702 has it following the lex. info!!

The print is consistently having the accent info before the lex. info all through its pages (there might be some cases in opposite, but they would surely be rare).

Andhrabharati · 2022-09-22T11:24:55Z

Slight correction.
499 is subset of 3169, and not in addition thereto. So total 3169 diffferences.
Quite sizeable.

@drdhaval2785, @gasyoun

Do you recall that at one time (about a year back?) I was saying that my estimate was that ~2% of HWs would be needing correction, which you were thinking to be 'a bit too much' of estimation (with so much of work/cleaning done over so many years)?

Whether it is the error in spelling or in accent, after all it is an error; and keep in mind that MW has wrote a spl. note about the accents in his introduction, citing its importance specifically in Sanskrit.

[I do not like to say this-- but as I am looking deep into the text, I am finding that more errors got introduced into the text, as against the cleaning part, esp. while tagging various entities; in summary, my feeling is that it was more of tagging that took place than cleaning the MW text.]

Andhrabharati · 2022-09-22T11:46:30Z

[I am quite sure there still would be more entries in the text, that need to be identified and corrected.]

We do not care about the text yet, only headwords.

Even I was talking about HWs only, @gasyoun [in my words, text is the typed matter (whether it is HWs part or the rest); when I mean the meaning(s) part, I would be specifically saying body portion]!!

Ref: sanskrit-lexicon/MWS#141

Ref: #141

funderburkjim · 2022-09-26T18:30:05Z

mw svarita corrections from pwg

This work was done in issue141 directory.

As mentioned above, there are many more metalines in pwg with svarita accents than in mw:

114 matches for "<k2>.*\^" in buffer: mw.txt  svarita accents
470 matches for "<k2>.*\^" in buffer: pwg.txt

The 114 in mw were compared to the printed mw, and a few corrections made. See change_mw_1.txt.
Then, for each of the 470 pwg entries, the corresponding mw entries were compared to printed mw,
and changes made. See change_mw_2.txt. There was also one typo corrected in pwg. The analysis uses variations of @drdhaval2785 's find_accent_diff.py .
After these changes to mw,

849 matches for "<k2>.*\^" in buffer: temp_mw_2.txt

svarita_mw_2.txt lists these metalines.

There are 81 additional cases (see See ad2arev.txt ) where pwg shows a svarita accent, but either

no corresponding MW headword is noted, OR
the corresponding MW headword has no accent. (and is not included in the 470)
- The significance of this difference between pwg and mw is unknown (to me).

There are numerous (about 125) cases where MW has, in addition to a svarita-accented form, also an unaccented form. For example namasya:

An iast version of the revised (temp_mw_2.txt) mw: mw_2_svarita_iast.zip

gasyoun · 2022-09-26T20:27:54Z

There are 81 additional case

Interesting indeed, so MW is not a pure copycat.

funderburkjim · 2022-09-26T21:06:31Z

possible next step: inheritance

71873 matches for "<k2>.*[\/^].*[-—]" in buffer: temp_mw_2.txt
For example, aMSa has an accent, and this accent is, in the CDSL coding, 'inherited' by
compounds of aMSa.

<L>10<pc>1,1<k1>aMSa<k2>a/MSa<e>1
 ...
<L>20<pc>1,1<k1>aMSakaraRa<k2>a/MSa—karaRa<e>3
<L>21<pc>1,1<k1>aMSakalpanA<k2>a/MSa—kalpanA<e>3
<L>22<pc>1,1<k1>aMSaprakalpanA<k2>a/MSa—prakalpanA<e>3
 etc.

I think this 'accent inheritance in compounds' principle of CDSL is likely wrong in general. For instance
<k2>a/MSa—karaRa should be changed to <k2>aMSa—karaRa (remove accent).

Should the principle be?
Always remove inherited accents in compounds unless MW specifically says to use them.
For example, the inherited accent 'sva^r' should be removed in svargiri and svarjit,
but retained in svarcakzas and svarcanas:

<L>259095<pc>1281,1<k1>svar<k2>sva^r<h>4<e>2
 ...
<L>259109<pc>1281,2<k1>svargiri<k2>sva/r—giri<e>3   to change to svar—giri
<L>259110<pc>1281,2<k1>svarcakzas<k2>sva^r—cakzas<e>3  ok
<L>259111<pc>1281,2<k1>svarcanas<k2>sva^r—canas<e>3   ok
L>259112<pc>1281,2<k1>svarjit<k2>sva/r—ji/t<h>a<e>3    to change to svar—ji/t

vvasuki · 2022-09-27T01:20:52Z

71873 matches for "<k2>.*[\/^].*[-—]" in buffer: temp_mw_2.txt For example, aMSa has an accent, and this accent is, in the CDSL coding, 'inherited' by compounds of aMSa.
<L>10<pc>1,1<k1>aMSa<k2>a/MSa<e>1
 ...
<L>20<pc>1,1<k1>aMSakaraRa<k2>a/MSa—karaRa<e>3
<L>21<pc>1,1<k1>aMSakalpanA<k2>a/MSa—kalpanA<e>3
<L>22<pc>1,1<k1>aMSaprakalpanA<k2>a/MSa—prakalpanA<e>3
 etc.
I think this 'accent inheritance in compounds' principle of CDSL is likely wrong in general. For instance <k2>a/MSa—karaRa should be changed to <k2>aMSa—karaRa (remove accent).

In all these particular cases, accent actually would lie in the second part of the compound. For bahuvrIhi compounds (and a few other exceptions), the first constituent's accent would be retained. This is not possible to determine programmatically. So, indeed, it is a good idea to remove accent from both parts <k2>aMSa—karaRa. However, for the convenience of those who care for accents, it would be a great idea to provide independent accents for both the constituents (by adding a string like (a/MSa + ka/raRa)), so that they can work out the final accent of the compound in their heads.

Of course, the accent of the first constituent is easily available, and that of the second part may or may not be determined without ambiguity by a further lookup (eg. both करण॑ and क॑रण exist). So, the accent of the second part can be shown only in unambiguous cases.

Should the principle be? Always remove inherited accents in compounds unless MW specifically says to use them. For example, the inherited accent 'sva^r' should be removed in svargiri and svarjit, but retained in svarcakzas and svarcanas:

Sounds like a good idea!

Andhrabharati · 2022-09-27T04:23:18Z

There are 81 additional case

Interesting indeed, so MW is not a pure copycat.

See my post above at #141 (comment)

Though MW99 has picked up much of its data from the Boethlingk's dictionaries, undoubtedly it did take help from many other sources and also has some independent work (I would estimate the ratio as 75:25 roughly, for the above two portions); thus, we cannot always take Boethlingk as the ultimate authority!

"and also has some independent work"

Andhrabharati · 2022-09-27T04:29:05Z

Should the principle be?
Always remove inherited accents in compounds unless MW specifically says to use them.
For example, the inherited accent 'sva^r' should be removed in svargiri and svarjit,
but retained in svarcakzas and svarcanas:

I had posted several messages above, on the same point--

#141 (comment)

Andhrabharati · 2022-09-27T04:34:42Z

There are numerous (about 125) cases where MW has, in addition to a svarita-accented form, also an unaccented form.

That is how it is!
The accent would change at different contexts, and also at different 'lexical' forms.
[Sometimes, even the same lexical form could be having different accents!]

gasyoun · 2022-09-27T20:15:40Z

it would be a great idea to provide independent accents for both the constituents (by adding a string like (a/MSa + ka/raRa)), so that they can work out the final accent of the compound in their heads.

@Andhrabharati that does not make much sense to me. To give something wrong, that one needs to recalculate in his head.

So, the accent of the second part can be shown only in unambiguous cases.

Programmatically?

75:25 roughly

Missed that one before.

vvasuki · 2022-09-28T01:14:40Z

it would be a great idea to provide independent accents for both the constituents (by adding a string like (a/MSa + ka/raRa)), so that they can work out the final accent of the compound in their heads.

@Andhrabharati that does not make much sense to me. To give something wrong, that one needs to recalculate in his head.

Saying Water (←H₂ + O₂) instead of ~~H₂O~~ water became wrong since when?

So, the accent of the second part can be shown only in unambiguous cases.

Programmatically?

Yes

75:25 roughly

Missed that one before.

Ref: sanskrit-lexicon/MWS#141

funderburkjim · 2022-10-02T22:59:48Z

Phase 2

The focus here is on the MW headwords whose 'k2' differs from PWG, where PWG has an udAtta accent, and where MW has non-samAsa entries.
(i.e., similar to prior phase, except here udAtta and prior phase was svarita).

The work is still in the issue141 directory. The mw change transactions are in
change_mw_3.txt (about 600 lines changed) . Details can be seen in the commit above.

Expand k2 syntax

There are headwords where two accented variants are presented.

<L>6230<pc>32,1<k1>anugra<k2>a/n-ugra,an-ugra/<e>1    <<< NOTE THE COMMA
<s>a/n-ugra</s> or <s>an-ugra/</s> ¦ <lex>mf(<s>A</s>)n.</lex> not harsh or violent, mild, gentle, <ls>RV.</ls> &c.<info lex="m:f#A:n"/><info or="6230,anugra"/>

It was convenient to extend the metaline convention to allow a comma-delimited list for k2. See the sections singleton_or_and changes and
temp_singleton_k2changes of change_mw_3. This resolved several of the udAtta accent differences with PWG.
At this point, there were 350+ mw entries to compare with pwg (see ad3_rev.txt). The mw print was examined by hand, and the CDSL k2 markup classified as '+' (200+ CDSL agrees with print) or 'x' CDSL k2 markup may disagree with print (160+ cases).
Then changes were made for the 'x' cases, see temp_change_mw_3b.txt section of change_mw_3.txt. After all the changes, there remain about 275 cases with udAtta accents classified as differing from PWG (out of about 5000 cases). These are shown
in file ad3b_rev.txt.

some rules

As the task progressed, I tried to develop rules to handle cases where the accent(s) in mw is not obvious, but requires some sort of inference. Sometimes, these rules are referenced in change_mw_3 (e.g. 50+ instances of Rule 1). The rules are:

Rule 1: only one accent per headword. Drop accent inherited from parent.
Rule 2: parent Xa/Ya Child (<s>am</s>) Xa/yam (i.e., inherited). Example uttaram
Rule 3: do not inherit accent in compound (similar to Rule 1)

Interested parties may wish to examine (in change_mw_3) instances of these rules.

Thus far, I have found in mw print only one exception to the only one accent per headword rule. tAjadBaNga, and I changed that to agree with pwg and noted as
an 'mw print change' .

next step

samAsa correction in mw.

Ref: sanskrit-lexicon/MWS#141

funderburkjim · 2022-10-17T16:35:46Z

a long road

The 'programmatic' mw accent corrections appear to me to be at an end. Further corrections require
manual review of mw.txt with the scans for all pages.
I've started this with pages 1-59.
Changes are in change_mw_6.txt.
The time required for these 59 pages was about 3 days, or 20 pages per day.

At this rate, the total cleanup remaining will require 2-3 months.

Andhrabharati · 2022-10-17T16:50:42Z

I had 'sensed' this, much before starting the programmatic approach!!

If the latest iast file is made, I might be able to help in the next portion of the corrections. (after a few days probably)

I had also noticed some pc errors in the metalines, that could also be covered in the manual checking of HWs.

funderburkjim · 2022-10-17T17:02:26Z

@Andhrabharati Request you to do some random checking of the first batch of changes above, in case I need to make any mid-course corrections in method.

The main non-accent change in metalines that I've noticed is with the 'pc' value for the last item in
a column. Quite often, the pc for this item incorrectly refers to the next column, and thus requires correction.

I'm also not examining the VN entries, since I believe you have previously corrected these, and I found no required corrections in the first few VN.

gasyoun · 2022-10-21T22:01:52Z

275 cases with udAtta accents classified as differing from PWG (out of about 5000 cases)

It will close the day when the Reverse Dictionary might get published thanks to such cleanup rounds.

pc for this item incorrectly refers to the next column, and thus requires correction.

Interesting to note

total cleanup remaining will require 2-3 months.

Major Tom calling for @Andhrabharati ))

Andhrabharati · 2022-10-21T23:24:33Z

@funderburkjim appears to have decided to work it out himself!!

[I had asked him to make the IAST file to do it; but he instead chose to continue the process with slp1, and has opened a new (continuation) issue]

Andhrabharati · 2022-10-21T23:32:52Z

And interestingly, seen that he is also filling up (some, if not all, of) the nom. case endings that I was talking about all these days for the past two years, that are missed/truncated in the current CDSL MW data!!

Probably, I might be able to do a full checkup once he finishes the process; though it takes his time, it definitely is a worthy spending at his end.

Andhrabharati · 2022-10-22T17:13:55Z

Probably @funderburkjim might close this issue, as another "continuation" issue is taken up now.

funderburkjim · 2022-10-24T03:16:06Z

nom. case endings

I am trying to do that mostly when it seems to give additional information for entries whose base form has an accent. One example is under uzRa/.

The (<s>as</s>) at the masculine form seems to give additional information (e.g., the masculine nom. singular is <s>uzRas</s> (we would write with visarga <s>uzRaH</s> but that's beside the point).)
This is instead of the possibly expected <s>uzRa/s</s> .
So, most of the nominative case endings added by me are like this.

Here is an example where I didn't add back the nom. singular form.

'as' here is the normal nominative singular ending for a masculine noun whose citation endings in 'a'.
And MW seems to me to be inconsistent in inserting the 'as'. For instance, there is no 'as' in uzmaka.

There would be no objection from me if, in his later review of mw, @Andhrabharati, he decides to be more thorough in adding to mw.txt the nominative endings which remain missing in the digitization.

gasyoun added the cleanup label Sep 20, 2022

vvasuki mentioned this issue Sep 22, 2022

Bad headword पर्ॐहु sanskrit-lexicon/cologne-stardict#35

Closed

funderburkjim added a commit to sanskrit-lexicon/csl-orig that referenced this issue Sep 26, 2022

mw svarita accent corrections.

d237351

Ref: sanskrit-lexicon/MWS#141

funderburkjim added a commit that referenced this issue Sep 26, 2022

mw svarita accent corrections, using pwg.

722fce2

Ref: #141

funderburkjim added a commit to sanskrit-lexicon/csl-corrections that referenced this issue Oct 2, 2022

MW print change. Ref: sanskrit-lexicon/MWS#141

a91a4ee

funderburkjim added a commit to sanskrit-lexicon/csl-orig that referenced this issue Oct 2, 2022

MW: accent correction, phase 2.

5285044

Ref: sanskrit-lexicon/MWS#141

funderburkjim added a commit that referenced this issue Oct 2, 2022

Accent correction, Phase 2. #141

5cd9ac0

funderburkjim added a commit to sanskrit-lexicon/csl-orig that referenced this issue Oct 12, 2022

mw: phase 3 of accent corrections.

c49a461

Ref: sanskrit-lexicon/MWS#141

funderburkjim added a commit to sanskrit-lexicon/csl-orig that referenced this issue Oct 12, 2022

MW 2 additional corrections to previous commit.

70b7392

Ref: sanskrit-lexicon/MWS#141

funderburkjim added a commit that referenced this issue Oct 12, 2022

MW accent corrections, phase 3. #141

d9bd758

funderburkjim added a commit to sanskrit-lexicon/csl-orig that referenced this issue Oct 17, 2022

MW accent update pages 0001-0059.

3630a00

Ref: sanskrit-lexicon/MWS#141

funderburkjim added a commit that referenced this issue Oct 17, 2022

accent review pages 0001-0059. #141

075dafa

funderburkjim mentioned this issue Oct 21, 2022

MWS accent correction, continue phase 3 #142

Closed

funderburkjim closed this as completed Oct 24, 2022

MW accent correction #141

MW accent correction #141

Comments

funderburkjim commented Sep 19, 2022

funderburkjim commented Sep 19, 2022

drdhaval2785 commented Sep 20, 2022

drdhaval2785 commented Sep 20, 2022

MW data

MW snippet

PWG data

PWG snippet

Andhrabharati commented Sep 20, 2022 • edited Loading

drdhaval2785 commented Sep 20, 2022

drdhaval2785 commented Sep 20, 2022

Andhrabharati commented Sep 20, 2022 • edited Loading

Andhrabharati commented Sep 20, 2022

drdhaval2785 commented Sep 20, 2022

Andhrabharati commented Sep 20, 2022 • edited Loading

drdhaval2785 commented Sep 20, 2022

drdhaval2785 commented Sep 20, 2022

Andhrabharati commented Sep 20, 2022 • edited Loading

Andhrabharati commented Sep 20, 2022 • edited Loading

vvasuki commented Sep 20, 2022

drdhaval2785 commented Sep 20, 2022

vvasuki commented Sep 20, 2022

Andhrabharati commented Sep 20, 2022 • edited Loading

gasyoun commented Sep 20, 2022

vvasuki commented Sep 21, 2022

vvasuki commented Sep 21, 2022 • edited Loading

Andhrabharati commented Sep 21, 2022 • edited Loading

Andhrabharati commented Sep 21, 2022 • edited Loading

Andhrabharati commented Sep 22, 2022 • edited Loading

Andhrabharati commented Sep 22, 2022 • edited Loading

Andhrabharati commented Sep 22, 2022 • edited Loading

Andhrabharati commented Sep 22, 2022 • edited Loading

Andhrabharati commented Sep 22, 2022 • edited Loading

funderburkjim commented Sep 26, 2022

mw svarita corrections from pwg

gasyoun commented Sep 26, 2022

funderburkjim commented Sep 26, 2022

possible next step: inheritance

vvasuki commented Sep 27, 2022

Andhrabharati commented Sep 27, 2022

Andhrabharati commented Sep 27, 2022

Andhrabharati commented Sep 27, 2022

gasyoun commented Sep 27, 2022

vvasuki commented Sep 28, 2022 • edited Loading

funderburkjim commented Oct 2, 2022

Phase 2

Expand k2 syntax

some rules

next step

funderburkjim commented Oct 17, 2022

a long road

Andhrabharati commented Oct 17, 2022 • edited Loading

funderburkjim commented Oct 17, 2022

gasyoun commented Oct 21, 2022 • edited Loading

Andhrabharati commented Oct 21, 2022

Andhrabharati commented Oct 21, 2022 • edited Loading

Andhrabharati commented Oct 22, 2022

funderburkjim commented Oct 24, 2022

Andhrabharati commented Sep 20, 2022 •

edited

Loading

Andhrabharati commented Sep 20, 2022 •

edited

Loading

Andhrabharati commented Sep 20, 2022 •

edited

Loading

Andhrabharati commented Sep 20, 2022 •

edited

Loading

Andhrabharati commented Sep 20, 2022 •

edited

Loading

Andhrabharati commented Sep 20, 2022 •

edited

Loading

vvasuki commented Sep 21, 2022 •

edited

Loading

Andhrabharati commented Sep 21, 2022 •

edited

Loading

Andhrabharati commented Sep 21, 2022 •

edited

Loading

Andhrabharati commented Sep 22, 2022 •

edited

Loading

Andhrabharati commented Sep 22, 2022 •

edited

Loading

Andhrabharati commented Sep 22, 2022 •

edited

Loading

Andhrabharati commented Sep 22, 2022 •

edited

Loading

Andhrabharati commented Sep 22, 2022 •

edited

Loading

vvasuki commented Sep 28, 2022 •

edited

Loading

Andhrabharati commented Oct 17, 2022 •

edited

Loading

gasyoun commented Oct 21, 2022 •

edited

Loading

Andhrabharati commented Oct 21, 2022 •

edited

Loading