Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding gujarati vocabulry dec 4 #1811

Closed
wants to merge 7 commits into from
Closed
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions doctr/datasets/vocabs.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,13 @@
"hindi_letters": "अआइईउऊऋॠऌॡएऐओऔअंअःकखगघङचछजझञटठडढणतथदधनपफबभमयरलवशषसह",
"hindi_digits": "०१२३४५६७८९",
"hindi_punctuation": "।,?!:्ॐ॰॥॰",
"gujarati_vowels": "અઆઇઈઉઊઋએઐઓઔઅંઅઃ ",
"gujarati_digits":"૦૧૨૩૪૫૬૭૮૯",
"gujarati_diacritics_consonants":"""કકાકિકીકુકૂકૃકેકૈકોકૌકંકઃખખાખિખીખુખૂખૃખેખૈખોખૌખંખઃગગાગિગીગુગૂગૃગેગૈગોગૌગંગઃઘઘાઘિઘીઘુઘૂઘૃઘેઘૈઘોઘૌઘંઘઃઙઙાઙિઙીઙુઙૂઙૃઙેઙૈઙોઙૌઙંઙઃચચાચિચીચુચૂચૃચેચૈચોચૌચંચઃછછાછિછીછુછૂછૃછેછૈછોછૌછંછઃ
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sarjil77 I tested a bit with your added vocab which raised some issues .. because we can't encode it - if i understood it correctly the leading letter combined with the dotted circle (for example: કૌ) is combined to one character but programmatically it's counted as 2 characters .. is there anyway to make these strings unicode conform ?

So at the end that each character in an image corresponds to 1 encoded character

if i filter your diacritics i get the following:

ઃકખગઘઙચછજઝઞટઠડઢણતથદધનપફબભમયરલવશષાિીુૂૃેૈોૌ્

btw with multiline strings the string needs to end with \ otherwise it's counted as linebreak

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sarjil77 Something like this:

"gujarati_letters": "તગખઢરજયશઆઐઊૂેપફુ્ઓૈાથીડૃદઠવનલષકિઅભઘઉઔઝઙઇઞઈધૌછટચબોમએણઋ",
"gujarati_digits":"૦૧૨૩૪૫૬૭૮૯",
"gujarati_punctuation": "૰ઽ◌ંઃ॥ૐ" + "૱",

length: 103
all chars: તગખઢરજયશઆઐઊૂેપફુ્ઓૈાથીડૃદઠવનલષકિઅભઘઉઔઝઙઇઞઈધૌછટચબોમએણઋ૦૧૨૩૪૫૬૭૮૯૰ઽ◌ંઃ॥ૐ૱!"#$%&'()*+,-./:;<=>?@[]^_`{|}~

? Not sure anyway 😅

This is what i get if i deduplicate it in python

the single diacritics (as addition to a char) are counted as standalone symbol
Screenshot from 2024-12-04 10-49-46

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @felixdittrich92 , noted, i am not sure right now, but i will look further into this.
:)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hello @felixdittrich92, you are right it is considering 2 characters like "ફુ્" which is diacritic which is taking 6 bytes. So in order to handle the diacritics to consider as a single character, we can use NFC (Normalization Form C) which will combine character with their diacritics into single code character and does not change the actual encoding or byte representation.

for eg:
import unicodedata

txt = "ફુ્"

encoded_string = txt.encode()

normalized_text = unicodedata.normalize('NFC', txt)

print(f'encoded string is:',encoded_string)
print(f'the length of encoded string is: {len(encoded_string)} ')
print(f'normalized_text is:', normalized_text)
print(f'the length of normalized encoded string is:{len(normalized_text)}')

output:
encoded string is: b'\xe0\xaa\xab\xe0\xab\x81\xe0\xab\x8d'
the length of encoded string is: 9
normalized_text is: ફુ્
the length of normalized encoded string is:3

please do have a look on this, and i do not know how other people have added diacritics, here we can also add just consonants and vowels but it will not make any sense.

Let me know your thoughts.

જજાજિજીજુજુજૃજેજૈજોજૌજંજઃઝઝાઝિઝીઝુઝૂઝૃઝેઝૈઝોઝૌઝંઝઃઞઞાઞિઞીઞુઞૂઞૃઞેઞૈઞોઞૌઞંઞઃટટાટિટીટુટૂટૃટેટૈટોટૌટંટઃઠઠાઠિઠીઠુઠૂઠૃઠેઠૈઠોઠૌઠંઠઃડડાડિડીડુડૂડૃડેડૈડોડૌડંડઃઢઢાઢિઢીઢુઢૂઢૃઢેઢૈઢોઢૌઢંઢઃણણાણિણીણુણૂણૃણેણૈણોણૌણંણઃતતાતિતીતુતૂતૃતેતૈતોતૌતંતઃથથાથિથીથુથૂથૃથીથૈથોથૌથંથઃ
દદાદિદીદુદૂદૃદેદૈદોદૌદંદઃધધાધિધીધુધૂધૃધેધૈધોધૌધંધઃનનાનિનીનુનૂનૃનેનૈનોનૌનંનઃપપાપિપીપુપૂપૃપેપૈપોપૌપંપઃફફાફિફીફુફૂફૃફેફૈફોફૌફંફઃબબાબિબીબુબૂબૃબેબૈબોબૌબંબઃભભાભિભીભુભૂભૃભેભૈભોભૌભંભઃમમામિમીમુમૂમૃમેમામોમાયમંમઃયયાયિયીયુયુયૃયેયૈયોયૌયંયઃરરારિરીરૂરૃરેરૈરોરૌરંરઃ
લલાલિલીલુલૂલૃલેલૈલોલૌલંલઃવવાવિવીવિવૂવૃવેવૈવોવૈવંવઃશશાશિશીશુશૂશૃશેશૈશોશૌશંશઃષષાષિષીષુષૂષૃષેષૈષોષૌષંષઃજ્ઞજ્ઞાજ્ઞિજ્ઞીજ્ઞુજ્ઞૂજ્ઞૃજ્ઞેજ્ઞૈજ્ઞોજ્ઞૌજ્ઞંજ્ઞઃ""",
"gujarati_punctuation": "૰◌્◌઼ઽ◌ઁ◌ંઃ॥ૐ" + "૱",
"bangla_letters": "অআইঈউঊঋএঐওঔকখগঘঙচছজঝঞটঠডঢণতথদধনপফবভমযরলশষসহ়ঽািীুূৃেৈোৌ্ৎংঃঁ",
"bangla_digits": "০১২৩৪৫৬৭৮৯",
"generic_cyrillic_letters": "абвгдежзийклмнопрстуфхцчшщьюяАБВГДЕЖЗИЙКЛМНОПРСТУФХЦЧШЩЬЮЯ",
Expand Down Expand Up @@ -58,6 +65,13 @@
)
VOCABS["hebrew"] = VOCABS["english"] + "אבגדהוזחטיכלמנסעפצקרשת" + "₪"
VOCABS["hindi"] = VOCABS["hindi_letters"] + VOCABS["hindi_digits"] + VOCABS["hindi_punctuation"]
VOCABS['gujarati'] = (
VOCABS['gujarati_diacritics_consonants']
+ VOCABS['gujarati_vowels']
+ VOCABS['gujarati_digits']
+ VOCABS['gujarati_punctuation']
+ VOCABS['punctuation']
)
VOCABS["bangla"] = VOCABS["bangla_letters"] + VOCABS["bangla_digits"]
VOCABS["ukrainian"] = (
VOCABS["generic_cyrillic_letters"] + VOCABS["digits"] + VOCABS["punctuation"] + VOCABS["currency"] + "ґіїєҐІЇЄ₴"
Expand Down
Loading