Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUR Study #37

Open
Andhrabharati opened this issue Jan 30, 2022 · 43 comments
Open

BUR Study #37

Andhrabharati opened this issue Jan 30, 2022 · 43 comments

Comments

@Andhrabharati
Copy link

Andhrabharati commented Jan 30, 2022

@drdhaval2785, @funderburkjim

Just thought of filling up the Greek strings in BUR, and had a quick look at the file & book contents.

  1. There are almost 15000 <P> entries which either do not "appear" in the CDSL online searching as of now, or are part of the prev. <L> entry, though present in the text file. Seems most of these (if not all) have to be "promoted" to <L> status, being alternate HWs or derived HWs etc. wrt the prev. entry.
    I would suggest marking them all with <L>xxx.n numbering, as separate entries.

  2. There are two good lists of Anubandhas (5pp.) & Dhatus (15pp.) in the book, after the p.759 (where the text file ended), which could also be digitized and added to the search.
    Do not know if this already done and lying somewhere "inaccessible". (Could not see them even in the bur_orig.txt)

@Andhrabharati
Copy link
Author

Andhrabharati commented Jan 30, 2022

  1. There are almost 9900 <lbinfo> tags, which are used to mark hyphenated words at the line-crossovers. But as the book "lines" are NOT maintained in the text file, these have no use at all and could be simply removed.

@Andhrabharati
Copy link
Author

  1. The punctuation marks (. , ; : ?), before the %} mostly need to be kept after the %}.
  2. The {%(...)%} and {%[...]%} markings to be changed to ({%...%}) and [{%...%}].

@gasyoun
Copy link
Member

gasyoun commented Jan 30, 2022

could also be digitized and added to the search.

Here is my scan: https://vk.com/samskrtamru?w=wall-88831040_13310

racines

as the book "lines" are NOT maintained in the text file

Sounds like a pity.

The punctuation marks (. , ; : ?), before the %} mostly need to be kept after the %}.

Guess is something you can do yourself @Andhrabharati with the pull github function?

@Andhrabharati
Copy link
Author

The punctuation marks (. , ; : ?), before the %} mostly need to be kept after the %}.

Guess is something you can do yourself @Andhrabharati with the pull github function?

Just fyi, I have been doing all such stuff myself, and (unfortunately!) I do much more than what cologne team can 'accept' (when I feel it leads to a better 'presentation' of the text, I leave no stone unturned).

@funderburkjim
Copy link

@Andhrabharati In the current Cologne system, a given dictionary xxx exists in three related forms:

  • xxx.txt, as it is in csl-orig/v02/xxx/xxx.txt
  • xxx.xml This is created from xxx.txt, by means of the make_xml.py function in csl-pywork repository
  • html - This is created from xxx.xml at run-time for a particular headword. The html is created by php programs in the csl-websanlexicon (or csl-apidev) directory.

There is also a stardict dictionary form created in https://github.com/sanskrit-lexicon/cologne-stardict repository (which Dhaval maintains completely).

So, if you create a 'better' form of some xxx.txt, then that may be incompatible with the make_xml.py or with the php display code. I think you find this incompatibility frustrating.

On the other hand, many changes can be made to xxx.txt that ARE compatible with the Cologne system.

In regard to your particular suggestions re Burnouf dictionary, I suggest you fill in the Greek text in csl-orig/v02/bur/bur.txt. This is a kind of change which is should cause no compatibility problems.

Once this is done, let's discuss further the idea of 'promoting' the <P> subheadwords to full status.

@funderburkjim
Copy link

@Andhrabharati Just realized you will likely be starting with bur.txt as it exists in this
csl-devanagari repository.

@drdhaval2785 Suppose AB adds greek text to this devanagari version of bur.txt.
Do you have a script that generates the slp1 version from the devanagari version?
And if so, have you checked invertibility?

@drdhaval2785
Copy link
Contributor

@funderburkjim

Invertibility is taken care of.
See the script redo.sh

echo "Convert to Devanagari."
mkdir -p ../v02/$1
python3 to_devanagari.py $1
echo "Convert back to SLP1."
python3 to_slp1.py $1
echo "Store differences in ../diff/$1.txt."
diff ../slp1/$1.txt ../../csl-orig/v02/$1/$1.txt > ../diff/$1.txt
echo "Complete."
  1. Convert to Devanagari
  2. Convert back to SLP1 and store in SLP1 folder (untracked by github, as it is supposed to be identical with csl-orig data).
  3. Compare the data in SLP1 folder with csl-orig data.
  4. If there is any difference, store it in diff folder.

Once the script is run, manually see that the diff folder holds all files with 0 bytes i.e. there is no difference. This way invertibility is ensured.

When a change is made in csl-devanagari files

See carry_changes_to_cslorig.sh

dicts=(wil yat gst ben mw72 lan cae md mw shs ap90 mwe bor ae bur stc pwg gra pw ccs sch bop armh vcp skd inm vei pui bhs acc krm ieg snp pe pgn mci)
echo "STARTED TAKING CORRECTIONS FROM CSL-DEVANAGARI TO CSL-ORIG";
for dict in ${dicts[@]};
do
	echo $dict
	python3 to_slp1.py $dict
	cp ../slp1/$dict.txt ../../csl-orig/v02/$dict/$dict.txt
	echo "";
done
  1. Convert the changes to SLP1 transliteration and store in SLP1 folder of csl-devanagari repository.
  2. Copy the changes from SLP1 folder to csl-orig folder.
  3. See the git diff in csl-orig folder to ensure that it is as per corrections made in csl-devanagari, if need arises.
  4. add, commit and push changes in csl-orig repository.

Hope this takes care of your concerns about invertibility, Jim.

@funderburkjim
Copy link

Thanks for docs. Looks eminently usable. Will give it a trial run if AB uploads a version of devanagari bur.txt with Greek text.

@drdhaval2785
Copy link
Contributor

Dear @Andhrabharati Please update the csl-devanagari repository and use the latest file.
It would minimize the differences.

@Andhrabharati
Copy link
Author

latest file? is this repo being updated?

@Andhrabharati
Copy link
Author

I think you find this incompatibility frustrating.

Not at all, I keep on doing what I feel better; it's just that CDSL is 'not willing' to 'accept' to undertake the changes, if they seem different to the 'style' adopted there-- having no scope for 'real improvements'.

@Andhrabharati
Copy link
Author

Andhrabharati commented Jan 31, 2022

Dear @Andhrabharati Please update the csl-devanagari repository and use the latest file.
It would minimize the differences.

I could as well just use the latest (SLP1) file from csl-orig itself, if it is just filling the Greek stuff.

But that's too little a portion of the work; I point to my recent INM work in this context, wherein I did quite some changes, apart from filling the Greek stuff all in one go. (of course, it did not attract the FULL attention of Jim.)

@Andhrabharati
Copy link
Author

Andhrabharati commented Jan 31, 2022

could also be digitized and added to the search.

Here is my scan: https://vk.com/samskrtamru?w=wall-88831040_13310

These pages are even at csldoc, as 'Dictionary front matter'; a misnomer for these particular pages!!

@drdhaval2785
Copy link
Contributor

After many years of association with CDSL, I would like to paraphrase your viewpoint so that it correctly reflects the status of collective wisdom at CDSL.

CDSL is 'not willing' to adopt major changes which do not allow programmatic conversion between current version and suggested version programmatically.

@Andhrabharati
Copy link
Author

Andhrabharati commented Jan 31, 2022

I do understand the point well, @drdhaval2785.

What I fail to understand is-- while programs are being modified or even developed for small changes, why the same is NOT being done for major changes. It's just beyond my comprehension!

Anyway, let's not spend more time on this, but continue the efforts in bringing the texts to "correct form" first and fill the gaps (if any).

("Presentation" can be taken up by someone sometime, if it deserves!)

@gasyoun
Copy link
Member

gasyoun commented Jan 31, 2022

having no scope for 'real improvements'.

It it is not in book - we can't accept such and improvement. Even if we like it.

These pages are even at csldoc

My scan quality is higher.

why the same is NOT being done for major changes. It's just beyond my comprehension!

Are you ready to code it? Jim is busy with things only he can do. We do not have enough coders on board.

bringing the texts to "correct form" first and fill the gaps (if any).

Exactly, thanks.

@Andhrabharati
Copy link
Author

Andhrabharati commented Jan 31, 2022

"If" it is not in book - we can't accept such "an" improvement.

I can show innumerable instances contradicting this, that are already present in the CDSL texts!
(But I do not want to drag the issue any further.)

My scan quality is higher.

Yes, noticed this. How many such others do you have?

Are you ready to code it?

Yes, I can; but I won't (at least for time-being)!

@funderburkjim
Copy link

I could as well just use the latest (SLP1) file from csl-orig itself, if it is just filling the Greek stuff.

Yes, that is so.

@Andhrabharati
Copy link
Author

Andhrabharati commented Jan 31, 2022

I am already halfway through my file, with many more changes already done.

And I presumed giving just the ref. line (<L> number or the <P> string whichever is applicable) and the greek strings in it would ease CDSL work.

@funderburkjim
Copy link

From my perspective, the best form would be a copy of bur.txt with all the Greek text filled in.

As a second choice, a file of changes to the lines of bur.txt. For example,,
the first Greek text appears on line 19 of bur.txt, so a file 'bur-change.txt' would have:

19 old <lang n="greek"></lang>; <ab>lat.</ab> {%in;%} <ab>germ.</ab> {%un.%}
19 new <lang n="greek">GREEK TEXT</lang>; <ab>lat.</ab> {%in;%} <ab>germ.</ab> {%un.%}

and a similar pair of 'old/new' lines for each of the other 667 lines with greek text.

As a third choice, a file of the lines changed. For example, the first Greek text appears on line 19 of bur.txt, so a file 'bur-greek.txt' would have as its first line

19 <lang n="greek">GREEK TEXT</lang>; <ab>lat.</ab> {%in;%} <ab>germ.</ab> {%un.%}

and similarly for the other 667 lines with greek text.

@Andhrabharati
Copy link
Author

Andhrabharati commented Jan 31, 2022

My file has no line breaks now; all entries are in a single line.

But, I prefer making the second form (but slightly different)-
19 old <lang n="greek"></lang>;
19 new <lang n="greek">GREEK TEXT</lang>;
[limiting only to the Greek portion and the resp. ending punctuation].

And few of them would be with ; comment lines followed.

Would this suit you?

@funderburkjim
Copy link

What about the few lines where there is more than one <lang n="greek"></lang> ?

@Andhrabharati
Copy link
Author

Andhrabharati commented Jan 31, 2022

They would all be in the resp. line, unless a comment line mentions some merger (if any); otherwise all diff. strings would be present individually.

@funderburkjim
Copy link

Likely I can reliably convert your form to my second form.

@Andhrabharati
Copy link
Author

If you are interested, I can give the full etym. lines (all languages) as well, as many had undergone changes, like tagging or correcting.

But probably sticking to Greek alone in the first step is preferable.

@funderburkjim
Copy link

sticking to Greek alone in the first step

Agree

@Andhrabharati
Copy link
Author

  1. There are some places where the <L> entry itself has few other <L> candidates.

    For example, <L>4388 ({%kāyastha%} <ab>m.</ab>) has -- {%kāyasthā%} <ab>f.</ab> and -- {%kāyasthī%} <ab>f.</ab> inside.

@Andhrabharati
Copy link
Author

The front pages matter (p.3) clearly mentioned the points 1 and 6.

[6] La barre horizontale -- sépare les mots dans un même article.
... ...
[1] Après un mot principal écrit en dêvanâgari, nous rangeons ceux de ses dérivés et de ses composés qui se trouveraient placés immédiatement après lui clans l'ordre alphabétique. Les autres dérivés ou composés, que cet ordre écarterait du voisinage immédiat du mot principal, sont rangés a leur place naturelle. De sorte que l'ordre alphabétique est partout suivi.

This indicates that making the digital text of all the dictionaries' "Front matters" (with Google OCRing) and probably translating into English (with DeepL) would be beneficial to understand the dictionaries' well, and plan to work on them properly.

Any takers for this simple task from your 'new team', @gasyoun?

@gasyoun
Copy link
Member

gasyoun commented Jan 31, 2022

I can show innumerable instances contradicting this, that are already present in the CDSL texts!

Indeed there are. I guess it would be a good idea to document them as we know them.

How many such others do you have?

Not sure, not all volumes required, but will show in 2022 what I have.

I can give the full etym. lines

Would love to see them myself.

Any takers for this simple task from your 'new team', @gasyoun?

Can you document the steps for them to be done, please? One by one.

@Andhrabharati
Copy link
Author

Any takers for this simple task from your 'new team', @gasyoun?

Can you document the steps for them to be done, please? One by one.

Hope @drdhaval2785 or @funderburkjim would be willing to give the steps.

@Andhrabharati
Copy link
Author

Andhrabharati commented Feb 1, 2022

Just recalled that you also worked with Abbyy OCR, @gasyoun.

So probably you yourself could get the first step done, by explaining to the team.

Once a quick proofing for obvious errors in the OCRed text is done, translation (as and when required) could be taken up.

@funderburkjim
Copy link

@funderburkjim
Copy link

@Andhrabharati Is your main point regarding Burnouf Front matter to make an English Translation of the front matter?

@Andhrabharati
Copy link
Author

Andhrabharati commented Feb 2, 2022

@Andhrabharati Regarding Burnouf Front matter. Are you aware of
https://www.sanskrit-lexicon.uni-koeln.de/scans/csldev/csldoc/build/dictionaries/prefaces/burpref.html ?

yes, I do. In fact, I had already commented previously that even the "end matter" is lying here under the header of "front matter"!

but these are just the images; and I am talking about searchable digital text.

@Andhrabharati
Copy link
Author

@Andhrabharati Is your main point regarding Burnouf Front matter to make an English Translation of the front matter?

not really, my main intention is to have a digital text first.

of course, having english text suits some people-- but there would be many people who might like to have the native language text as is.

@gasyoun
Copy link
Member

gasyoun commented Feb 2, 2022

Just recalled that you also worked with Abbyy OCR, @gasyoun.

Yes, since 2002.

https://www.youtube.com/watch?v=oXH65ISgZRo and https://www.youtube.com/c/MarcisGasuns/search?query=abbyy

of course, having english text suits some people-- but there would be many people who might like to have the native language text as is.

Agree

@Andhrabharati
Copy link
Author

Andhrabharati commented Feb 6, 2022

@funderburkjim

I had finished filling Greek strings in BUR few days back and just waiting for you to be free from the MBh. linking task.

Here are the lines (wrt csl-orig file) as we discussed earlier (above), and hope you won't be facing much issues in using this data.

BUR greek string lines (csl-org) filled.txt

I just like to suggest that you handle the ; commented ones first.

@Andhrabharati
Copy link
Author

Andhrabharati commented Feb 6, 2022

There are NO ls candidates in BUR, but quite many abbr candidates are there.

Here is the list that covers most of them.
BUR abbr. list.txt
[The count is more than double the existing CDSL file markings.]

And here are the language abbr. items that could be tagged first, and expanded.
BUR language tags.txt

@Andhrabharati
Copy link
Author

Andhrabharati commented Feb 6, 2022

As I am doubting if you would be interested to do any further changes, not posting my full observations, but only giving some global corrections (just in case you like to correct them) below-
BUR corrections.txt

@Andhrabharati
Copy link
Author

One final comment before I move on to some other work-

There are quite many grouped entries in this work as well (marked with et, au, ',' or otherwise), and these could be handled as done in MW.
[I had earlier suggested doing the same in few other works also, but nothing has happened in that front so far.]

@funderburkjim
Copy link

@Andhrabharati From first look at your greek text lines, the form should be readily useable. Will let you know when this is incorporated into bur.txt.

@Andhrabharati
Copy link
Author

Here is my full BUR file, for whatever use/worth it has to the cdsl team--
bur (AB ver.) -v2.txt

@gasyoun
Copy link
Member

gasyoun commented May 11, 2022

full BUR file

@funderburkjim did you had a chance to take an eye on it ever since?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants