-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MW accent correction #141
Comments
See #137 (comment) for another approach to detecting accent problems. |
Kindly look at the following entry PWG
PWG display
MW
MW display
Note the PWG |
MW typo examples MW data
MW snippetMW marked both with '/', whereas they are different. PWG data
PWG snippetPWG shows them both to have different. |
Good to see that @drdhaval2785 is also finding the 'inconsistency' issues across the dictionaries' data now. A standing rule to be followed would be that whatever is the internal representation of the text (file) data is, the end result (display or otherwise) should tally with the printed matter, whether it is in the dictionary itself or in the reference work that it is citing from. [I would have recommended Dhaval to post the citation matter from the RV and AV as well (as the case may be), to make the argument further strong/appealing. Ultimately they are the ones that are to be referred to, the dictionaries or others are just helping to reach them.] |
https://github.com/sanskrit-lexicon/MWS/blob/master/accent_diff/log.tsv - The highest priority accent differences https://github.com/sanskrit-lexicon/MWS/blob/master/accent_diff/log.html - HTML can be downloaded and checked manually if needed. |
Once this is done, we can go to the next step. TSV file - https://github.com/sanskrit-lexicon/MWS/blob/master/accent_diff/log_with_compounds.tsv examples
Here, the headword is As the compound parsing was done through some program in MW, the accent portion of it was not properly handled or could not be properly handled. We need to convert it back to My hunch is that we should keep the accent as per the second part of the compound, and ignore the accent of first part of compound. Not sure. |
In general this is the principle to be followed, @drdhaval2785 ; in cases where the accent needs to be retained on the first part, the print has invariably mentioned it just before its (entry word's) lexical info (gender or otherwise). I see that quite many of those portions are missing in the MW digitisation, though occasionally present scattered (just like the nom. case endings that I was talking about all these days). |
See @gasyoun , Dhaval has come out now with two lists (499 + 3169) counting to about 3600 entries, corroborating my estimate of more than couple of thousands as posted at #140 (comment). [I am quite sure there still would be more entries in the text, that need to be identified and corrected.] |
Slight correction. |
First option is to look for those in pwk, @drdhaval2785, and then in the pwkvn and also the "leftout" VN portions of PWG volumes 1-4 and 6. There are quite many in those pages, that did not come in PWG VN pages of Vol. 5 and Vol.7 |
On cursory look at https://github.com/sanskrit-lexicon/MWS/blob/master/accent_diff/log.tsv, it seems that large chunk of it ends with |
Seems a reasonable way. |
These VN pages are the Additions and Corrections to pwk and PWG volumes [printed at the end of respective volume or in the later volume(s)]. pwkvn (of all the 7 volumes) is hosted under a separate repo, as Jim has his own reasons to not to club to the pwk text, as in the case of every other CDSL work. I had proposed him once to combine it and then left the matter. After my pointing out the matter being missed altogether, there were some trials to derive the pwkvn data from SCH data, but finally it was decided to completely get those pages retyped. Jim seems to have funded (James Funderburk > Fund; hope Jim does not mind my saying thus) the digitisation expenses, as per Thomas. PWG VN portions of Vol. 5 and Vol. 7 are after the PWG main portions of those two volumes respectively. |
I did not look at the items at all in the two lists, just seen the numbers. Thought they would have been different, looking at this line--
[I did not expect that the entries would have been repeated in another list, once 'done' in a list, as indicated in the above statement.] |
But, it should not be the only "end result", or there would be no question of devanAgarI headwords for MW etc.. It is desirable, as mentioned elsewhere (sanskrit-lexicon/csl-ldev#7 (comment)) to additionally (and prominently) show the accent in a standardized format.
@drdhaval2785 This is a matter of jAtya-svarita - or svarita arising from internal sandhi and not as a consequence of following an udAtta. In such words, instead of an udAtta setting the tone, you have a svarita. It occurs only after ya or va. For example, in case of shapathya, Bohtlingk-Sanskrit-Worterbuch-in-kurzerer-Fassung decyphers it as शपथि꣫अ . So, should be easy to detect programmatically. |
Good point raised @vvasuki . Thanks. |
An important point of clarification regarding the above. RV, SV, AV etc.. are NOT the only source of svara-s (as the bhAShyakAra says - it's impossible to list all sAdhu-shabda-s) - we have vyAkaraNa to deduce svara-s (which are incidentally a must in truly "proper" laukika speech as per shAstra). So, it is not inconceivable that the dict makers would have used the very same sUtras that we refer to today when determining svara-s in some cases. |
@funderburkjim [I was thinking of asking you this for many days now, but waiting for a suitable time.] |
As there is ERRATA not implemented, not always so.
Adresses that errata portion.
And it is stated by Dhaval who did work on accent issues programmaticaly years ago.
A single sample?
We do not care about the text yet, only headwords.
Can you trace the statement, please?
They used only Vedic svaras, they did not determine NOTHING. |
How are you so sure? Whitney deals with svara-s quite well in his grammar. They would be dumb to not have used simple rules which they would have doubtless encountered via sAyaNa's commentary and native informants.
Accent in compounds is not a "one rule for all" thing, if that's what you're talking about. Major rules are summarized here . Best to show the accents of parts separately in case accent for the whole cannot be determined by lookup.
Vaguely recalled from paspashAhnika of mahAbhAShya, but could not find it exactly now. Just found this, which implies that - लघ्वर्थं चाध्येयं व्याकरणम् । 'ब्राह्मणेनावश्यं शब्दा ज्ञेयाः' इति, न चान्तरेण व्याकरणं लघुनोपायेन शब्दाः शक्या विज्ञातुम् ॥ . So, might be in kaiyaTa's comment. |
Found it - अथैतस्मिञ् शब्दोपदेशे सति किं शब्दानां प्रतिपत्तौ प्रतिपदपाठः कर्तव्यः - गौरश्वः पुरुषो हस्ती शकुनिर् मृगो ब्राह्मण इत्येवमादयः शब्दाः पठितव्याः ? |
I just recalled why I said I have no great expectations in programmatic approach to @funderburkjim in the other (parent) issue. There are quite some entries that were to be corrected for accents in both PWG and pwk (& pwkvn). So even if the metalines' k2 entries are compared between MW and these, those VN forms still remain uncaught, for those are lying in the body portion still, and not carried into the HW portion yet. It was with my intervention that this correction has happened in just MW (last year), from its annexure data. @drdhaval2785 and @funderburkjim may think of getting some means to cover this point in a programmatic way. [I have some other points at the back of my mind, and would post subsequently at some time.] |
I see no use of giving any example, @gasyoun !! I don't have to prove here that I have looked into the print (pdf) and text (cdsl file) data close enough. |
Just opened the two log files by @drdhaval2785 , and noticed that neither of them contain the 'aMhu' entry. MW has it as two parts Incidentally the 2nd part Also for future ref. (if it ever happens!), note the accent difference in the cross-referred word, Though MW99 has picked up much of its data from the Boethlingk's dictionaries, undoubtedly it did take help from many other sources and also has some independent work (I would estimate the ratio as 75:25 roughly, for the above two portions); thus, we cannot always take Boethlingk as the ultimate authority! PWG has-- pwk has-- Leaving the actual differences between MW and PWG/pwk accents (as above) aside, the main point is that, the 'logic' used in identifying the differences programmatically needs some 'refining'. |
@gasyoun
I was referring to all such cases, in my post above-- #141 (comment) |
Another (minor) discrepancy of CDSL data wrt the print can be seen in the above snippet-- The print is consistently having the accent info before the lex. info all through its pages (there might be some cases in opposite, but they would surely be rare). |
Do you recall that at one time (about a year back?) I was saying that my estimate was that ~2% of HWs would be needing correction, which you were thinking to be 'a bit too much' of estimation (with so much of work/cleaning done over so many years)? Whether it is the error in spelling or in accent, after all it is an error; and keep in mind that MW has wrote a spl. note about the accents in his introduction, citing its importance specifically in Sanskrit. [I do not like to say this-- but as I am looking deep into the text, I am finding that more errors got introduced into the text, as against the cleaning part, esp. while tagging various entities; in summary, my feeling is that it was more of tagging that took place than cleaning the MW text.] |
Even I was talking about HWs only, @gasyoun [in my words, text is the typed matter (whether it is HWs part or the rest); when I mean the meaning(s) part, I would be specifically saying body portion]!! |
mw svarita corrections from pwgThis work was done in issue141 directory. As mentioned above, there are many more metalines in pwg with svarita accents than in mw:
The 114 in mw were compared to the printed mw, and a few corrections made. See change_mw_1.txt.
svarita_mw_2.txt lists these metalines. There are 81 additional cases (see See ad2arev.txt ) where pwg shows a svarita accent, but either
There are numerous (about 125) cases where MW has, in addition to a svarita-accented form, also an unaccented form. For example namasya: An iast version of the revised (temp_mw_2.txt) mw: mw_2_svarita_iast.zip |
Interesting indeed, so MW is not a pure copycat. |
possible next step: inheritance
I think this 'accent inheritance in compounds' principle of CDSL is likely wrong in general. For instance Should the principle be?
|
In all these particular cases, accent actually would lie in the second part of the compound. For bahuvrIhi compounds (and a few other exceptions), the first constituent's accent would be retained. This is not possible to determine programmatically. So, indeed, it is a good idea to remove accent from both parts Of course, the accent of the first constituent is easily available, and that of the second part may or may not be determined without ambiguity by a further lookup (eg. both करण॑ and क॑रण exist). So, the accent of the second part can be shown only in unambiguous cases.
Sounds like a good idea! |
See my post above at #141 (comment)
"and also has some independent work" |
I had posted several messages above, on the same point-- |
That is how it is! |
@Andhrabharati that does not make much sense to me. To give something wrong, that one needs to recalculate in his head.
Programmatically?
Missed that one before. |
Saying
Yes
|
Phase 2The focus here is on the MW headwords whose 'k2' differs from PWG, where PWG has an udAtta accent, and where MW has non-samAsa entries. The work is still in the issue141 directory. The mw change transactions are in Expand k2 syntaxThere are headwords where two accented variants are presented.
It was convenient to extend the metaline convention to allow a comma-delimited list for k2. See the sections singleton_or_and changes and some rulesAs the task progressed, I tried to develop rules to handle cases where the accent(s) in mw is not obvious, but requires some sort of inference. Sometimes, these rules are referenced in change_mw_3 (e.g. 50+ instances of Rule 1). The rules are:
Interested parties may wish to examine (in change_mw_3) instances of these rules. Thus far, I have found in mw print only one exception to the next stepsamAsa correction in mw. |
a long roadThe 'programmatic' mw accent corrections appear to me to be at an end. Further corrections require At this rate, the total cleanup remaining will require 2-3 months. |
I had 'sensed' this, much before starting the programmatic approach!! If the latest iast file is made, I might be able to help in the next portion of the corrections. (after a few days probably) I had also noticed some pc errors in the metalines, that could also be covered in the manual checking of HWs. |
@Andhrabharati Request you to do some random checking of the first batch of changes above, in case I need to make any mid-course corrections in method. The main non-accent change in metalines that I've noticed is with the 'pc' value for the last item in I'm also not examining the VN entries, since I believe you have previously corrected these, and I found no required corrections in the first few VN. |
It will close the day when the Reverse Dictionary might get published thanks to such cleanup rounds.
Interesting to note
Major Tom calling for @Andhrabharati )) |
@funderburkjim appears to have decided to work it out himself!! [I had asked him to make the IAST file to do it; but he instead chose to continue the process with slp1, and has opened a new (continuation) issue] |
And interestingly, seen that he is also filling up (some, if not all, of) the nom. case endings that I was talking about all these days for the past two years, that are missed/truncated in the current CDSL MW data!! Probably, I might be able to do a full checkup once he finishes the process; though it takes his time, it definitely is a worthy spending at his end. |
Probably @funderburkjim might close this issue, as another "continuation" issue is taken up now. |
I am trying to do that mostly when it seems to give additional information for entries whose base form has an accent. One example is under uzRa/. The Here is an example where I didn't add back the nom. singular form. 'as' here is the normal nominative singular ending for a masculine noun whose citation endings in 'a'. There would be no objection from me if, in his later review of mw, @Andhrabharati, he decides to be more thorough in adding to mw.txt the nominative endings which remain missing in the digitization. |
In #140, it was mentioned that there are many errors in the coding of accents in the CDSl version of MW.
This issue devoted to correcting these errors.
It is reasonable to restrict to headwords.
The 'k2' (key2) field in the metaline shows accents.
We can assume there should be consistency in accent
between MW and the Boehtlingk dictionaries (PW, PWG).
A reasonable first step might be to look at the svarita accents.
For instance:
We could do such a comparison by program and print out the exceptions
for hand examination.
The text was updated successfully, but these errors were encountered: