-
Notifications
You must be signed in to change notification settings - Fork 453
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
adding gujarati vocabulry dec 4 #1811
Closed
Closed
Changes from 1 commit
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
4b03ffa
adding gujarati vocabulry dec 4
bhavya-work 90707c0
updated gujarati vocab dec 6
f88a316
updated docs and gujarati langugae vocab
394661f
[Feat] Add torch.compile support (#1791)
felixdittrich92 bc1837e
[Bug] Fix vocabs and add corresponding test case (#1813)
felixdittrich92 d57ea5f
build(deps): bump JamesIves/github-pages-deploy-action (#1816)
dependabot[bot] 8fba03c
updates dataset.srt
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sarjil77 I tested a bit with your added vocab which raised some issues .. because we can't encode it - if i understood it correctly the leading letter combined with the dotted circle (for example: કૌ) is combined to one character but programmatically it's counted as 2 characters .. is there anyway to make these strings unicode conform ?
So at the end that each character in an image corresponds to 1 encoded character
if i filter your diacritics i get the following:
btw with multiline strings the string needs to end with
\
otherwise it's counted as linebreakThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sarjil77 Something like this:
length: 103
all chars: તગખઢરજયશઆઐઊૂેપફુ્ઓૈાથીડૃદઠવનલષકિઅભઘઉઔઝઙઇઞઈધૌછટચબોમએણઋ૦૧૨૩૪૫૬૭૮૯૰ઽ◌ંઃ॥ૐ૱!"#$%&'()*+,-./:;<=>?@[]^_`{|}~
? Not sure anyway 😅
This is what i get if i deduplicate it in python
the single diacritics (as addition to a char) are counted as standalone symbol
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @felixdittrich92 , noted, i am not sure right now, but i will look further into this.
:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hello @felixdittrich92, you are right it is considering 2 characters like "ફુ્" which is diacritic which is taking 6 bytes. So in order to handle the diacritics to consider as a single character, we can use NFC (Normalization Form C) which will combine character with their diacritics into single code character and does not change the actual encoding or byte representation.
for eg:
import unicodedata
txt = "ફુ્"
encoded_string = txt.encode()
normalized_text = unicodedata.normalize('NFC', txt)
print(f'encoded string is:',encoded_string)
print(f'the length of encoded string is: {len(encoded_string)} ')
print(f'normalized_text is:', normalized_text)
print(f'the length of normalized encoded string is:{len(normalized_text)}')
output:
encoded string is: b'\xe0\xaa\xab\xe0\xab\x81\xe0\xab\x8d'
the length of encoded string is: 9
normalized_text is: ફુ્
the length of normalized encoded string is:3
please do have a look on this, and i do not know how other people have added diacritics, here we can also add just consonants and vowels but it will not make any sense.
Let me know your thoughts.