-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MWS accent correction, continue, phase 4 #145
Comments
@Andhrabharati Here is the IAST version for your use. temp1_mw_extra_iast.zip Request: If you find the need to change the number of lines, please defer these.
You may want to review the two_accent.txt file of #142, or you may prefer to let @AnnaRybakovaT do this. LOOKING FORWARD TO WHAT YOU FIND! |
I can start this only after a couple of days. Meanwhile, can this update (so far done) be made public? I am sure no one would notice the difference, as has been the case all these years!! [This would be a silent (but worthy) improvement on the existing data.] |
The changes (of #142) already are public! I'll aim to install the latest user corrections to MW before you turn to accent review, and will post comment here when these user corrections have been made,, and will revise your iast version accordingly. |
If @AnnaRybakovaT is to take up the two-accents file, she can easily finish the task with the 'tool' that I suggested at #142 (comment) |
The purpose of reviewing the two-accent file is to compare consistency of mw.txt (as shown in the two_accent.txt file) with
PW is not needed for this analysis. |
OK, this is a fairly simple task then. |
Just opened the file and seen that it contains 178 cases of two (or more) accents, and not 177 as mentioned by you. The line I used the same regex-- |
@Andhrabharati do you believe it would ever make sense? |
Strictly speaking, YES; they should tally when the same word is being referred to. I would just bring up the point that BR have chosen to put accents on Devanagari text, and MW opted to put them on Roman text. It is the fault of any one to consider that they are having different notations; it is just the script difference, no absolute accent difference between them. If they are to be transcribed into any other scripts for comparison, they have to tally -- no 2nd thought on this point. And, the -ar and few other endings that Boethlingk had opted could easily be taken care of to be in sync with all others' works. |
Here are my remarks wrt your readme_extra file-- You may go through these once and take appropriate action. |
That would be of utmost interest, as I'm interested in an index - index to all the words from Sanskrit dictionaries, accents included, where known.
Yes, all such issues are noted. Not tens of them. |
As the two_accent file is very small, started looking at it and seen 100 entries so far (out of 178). Noted 44 |
Also noted few cases, where the accent is added when not in print. Possibly, there might be contra-cases where the accent in print got missed in the text. This makes me to think (and decide) on reading the full HWs once, instead of just the entries with accent marks (in the text), in my next perusal. |
I see that the CDSL search has agnī-varuṇau only, as against the agnī́-váruṇau in the iast file I got from you. What is causing this difference? |
Finished looking at the 178 entries identified above. And the file with my remarks for your necessary action is hereunder, @funderburkjim -- [I am sure that you can identify the differing entries very easily from this file; as such I did not mark them separately.]
|
@Andhrabharati I have passed the baton to you. Thus I will leave these corrections to you to do. |
OK; then, things will take place sometime later at my end. I haven't touched the iast file yet, except for random browsing. |
Thanks! |
just a small query @funderburkjim , |
It is better that you make the corrections mentioned in readme_extra.AB.txt, and two_accents_iast.AB.txt, since you fully understand exactly what needs to be done. If Anna had reviewed two_accents, probably you and she would have worked together, and in the end you would have included the corrections in your iast file. Once you've made corrections to iast file and made the revised file available, my task will include such steps as:
|
I guess these, and any more such if found in my perusal, would have to make addl. entries, with appropeiate taggings. |
Hurray.
Would love to hear more details in 2023. |
I'm not sure how to get your 460. My first thought was that by 'comma between' you were talking about commas in the 'k2' field of metalines, But then I realized that you were likely talking about cases that could be considered as alternate headwords which have been 'missed' -- and any such might very well require new entries. I thought you were going to focus on corrections to accent markup. If these multiple headwords are top of mind for you now, we can deal with them first AND IN A SEPARATE ISSUE. It will help me think with you on this if you provide a file of the 460. Or, consideration of these 460 can be done after your accent correction -- your choice. |
multiple accent patternsidentified by Comma in k2 due to multiple accent types
This takes care of 189+15 = 214 of your 460 cases.
In other words, these 160 account for 160 of your 460. T Partially similar with the 'B' cases (and 2 'C' cases). |
starting point for AndhrabharatiI've now finished user corrections for mw. temp_mw_01_iast.zip is ready for you. |
Just 36 corrections! Would you be willing to consider my working, if I do some (kind of) major changes? I will go step-by-step, so that you can 'follow' the process without much effort. As I have to look at every page and entry, I think it is the best chance to make use of this opportunity to read the text fully wrt the print, and incorporate necessary changes/corrections in the mw text. I have some good reasons, to have decided on taking up this path. |
The 'additional' work (from what you listed above) I foresee at your end is mostly to write (or re-run, as I am sure you would've already wrote those earlier) some small programs, to correlate my working with CDSL text. |
I think @Andhrabharati has been constantly shown his interest to do major overhauls in one go. I think, the way we can do this is like the following.
Does this make sense to all concerned? Iff this goes through, we would be able to take maximum advantage of @Andhrabharati’s potentials. |
Quite happy to see you coming in, @drdhaval2785 !! You have come up with a good proposal and I would like to say that we two can do the working on MW, and involve Jim at a later (final ?) stage, so that he can be on other major works-- PWG, pwk and I am shortly going to offer him a similar work (biblio-related) on Vacaspatyam(!!). |
Observation-1a: There are two lines (575366 & 575369) having two broken vertical bars; there should be only one per line.
to be changed as
I would be looking for trivial (and non-intrusive) errors as well (like this) in my working. |
I am OK with the suggestion put forward by you. I may not be able to chip in on daily basis, but maybe weekly basis. |
Well understood, @drdhaval2785 ! Even I am not looking for any daily works. |
And I would like to close this 'accent' related issue with the corrections based on my above posted two files, and start a new issue "Thorough review of MW text", as my proposal is far beyond just accents. |
Observation-1b: Observation-1c: A cursory look at any page of MW clearly shows that every HW entry has a comma separating it with the word-ending (if given) or the gender info, before the meaning etc. is started. This is the notation adopted by MW. So, we need to insert the comma almost in all those 214K cases. Observation-1d: This finishes my study on the body marker |
Now I wait for the response from @funderburkjim and @drdhaval2785 ; to know if they agree in how I am going to work with the MW text, and are willing to incorporate all such corrections in the CDSL file. If they feel that I am doing some irrelevant and uncalled for (extra) work, I don't have to proceed this way; but will try to limit myself to what they suggest, so that my time and effort are spent in useful manner. |
The organizational ideas above by Andhrabharati and Dhaval seem constructive. I would like to be kept in the loop in the beginning. Once the method stabilizes, I would Suggest each 'step' ( or small set of steps) have
Let @Andhrabharati do a first step to get this multiple-step process started! |
Good to hear this, @funderburkjim ! Hopefully, by next Christmas we'd have MW text brought closer to the print with some value added markings. Merry Christmas to you, and thanks for spending time to correct the accents portion to a major extent; I do not know if anyone earlier had 'bothered' about this important point, but I sure did. |
I believe we are ready for the yearly Skype call. How about 5th of January? We had it 12 am NY time @funderburkjim?
Of much interest to listen to the scope over Skype.
Yeah a year of weekly loops sounds reasonable.
As far as I'm aware it was never considered an issue. For me it's important because now I will be able to add accents to my index of all known Sanskrit words. |
On a 2nd thought, I think it is better to close this issue here itself as is (as these points would anyway be covered in the wholesome reading), and start a new issue for a full 'reviewing'. And I have separated out the trailing 'info' tags, and also removed the slp1 texts throughout (under 's1' tags: 53100 and 'ab n=' tags: 2540). This facilitates a free (and unobtrusive) reading of the file. Here are the two files, that I have made from the temp_mw_01_iast.txt file-- I hope this is acceptable. |
Do you have Python on your computer? I |
two useful programsTwo programs added to issue145: diff_to_changes_dict.py and updateByLine.py. These help to analyze what AB did. |
Yes, I have it. |
I just used regex process, no programming. |
file naming conventionIt is awkward to have space-characters in file names.
significance of 'temp' filenamesThe .gitignore file in this repository has a statement 'temp*', which means that There is some art in deciding what should be temp and what should not be temp. |
temp_mw_01_iast.txt is the original iast version (for this repository). Both files have the same number of lines. Hurray!
get file of changesThis is where the diff_to_changes.py file is useful. It is applicable since the two files have the same
updateByLine.pyupdateByLine.py constructs a new file from an old file and a change file.
Output to terminal is
Now we can compare temp_mw_02_iast.txt and temp.txt, using diff utility (part of git bash terminal)
|
Wrong issue!Just noticed I should have put these comments in sanskrit-lexicon/mw-dev#2! I've copied issue145 folder to issue146. Agree that we can close this issue145 now. |
@gasyoun That date/time is fine with me. |
Right? @Andhrabharati , @drdhaval2785 , @SergeA ? |
I will not be able to join on that day. Saturday or Sunday after 4 pm IST would be suitable for me. |
Saturday better than Sunday. What about |
As I continued marking the MW data in my intended way [if not used at CDSL, useful for someone else in future; most probably at our own site, wherein we did not pursue updating the Skt. Dictionaries for past 6 years!!], (accidentally) came across Though these two are not marked as a OR-group, they do exist as different entries. Just wanted to bring this info to Jim's notice, whether he agrees to make all such 'eligible' words to be separate entries or not.
And I am seeing that too many (running into couple of thousands) HWs-- apart from the above mentioned 460 cases, are 'missed' that could be made as separate entries; some as grouped (OR and AND) and some of other type. |
Hard for me, as many Sanskrit classes are there. Evening Sunday is bad for Dhaval. Saturday at best I have an hour in between, like 7 or 8th of January.
I am your fan. |
@gasyoun choose a day and time for your side of the world. I'll probably be able to join the meeting. |
@gasyoun AFAIK, as of this moment, NO meeting day/time is set. |
missed alt headwordsI think of there being several 'kinds' of alternate headwords identifiable in MW
I see no good reason to adhere to the MW printed text in these last two cases, Better to follow model of I am sure there is currently inconsistency in mw.txt especially in the last two ('see' and 'list of works') cases. |
Further review of accents in MWS., based on the version of MW at #142;
Namely, version of mw.txt in sanskrit-lexicon/csl-orig repository at v02/mw/mw.txt at commit 360db2b.
The text was updated successfully, but these errors were encountered: