Skip to content
This repository has been archived by the owner on Jun 15, 2024. It is now read-only.

Did I hit a bug in gcld3? #64

Open
rvencu opened this issue Nov 3, 2021 · 0 comments
Open

Did I hit a bug in gcld3? #64

rvencu opened this issue Nov 3, 2021 · 0 comments

Comments

@rvencu
Copy link

rvencu commented Nov 3, 2021

I made a few experiments to find out what would be the result of detection for some text not in the supported languages list. While it appears that whatever is detected is unreliable so I can reject the detection, I stumbled upon an example where the result is unexpectedly bad

import gcld3
detector = gcld3.NNetLanguageIdentifier(min_num_bytes=0, max_num_bytes=1000)
sample = "The last part of this text is pure gibberish with well crafted punctuation. Този текст е на Български. Sdslkmnscd scsun dc mcsaducsdnmlmc icmmklmdsc!"
result = detector.FindTopNMostFreqLangs(text=sample, num_langs=5)
for i in result:
    print(i.language, i.is_reliable, i.proportion, i.probability)```

will surprisingly output this:
`en True 0.4444444477558136 0.9999370574951172
bg True 0.28070175647735596 0.9173890948295593
hu True 0.27485379576683044 0.9084945917129517
und False 0.0 0.0
und False 0.0 0.0`

for one part good text and second part garbage it depends which is first and which has bigger proportion but the result can be correctly interpreted, however the above example is quite bad.
@rvencu rvencu changed the title Did I hit a bug in gcld3 Did I hit a bug in gcld3? Nov 3, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant