Did I hit a bug in gcld3? #64

rvencu · 2021-11-03T17:47:45Z

I made a few experiments to find out what would be the result of detection for some text not in the supported languages list. While it appears that whatever is detected is unreliable so I can reject the detection, I stumbled upon an example where the result is unexpectedly bad

import gcld3
detector = gcld3.NNetLanguageIdentifier(min_num_bytes=0, max_num_bytes=1000)
sample = "The last part of this text is pure gibberish with well crafted punctuation. Този текст е на Български. Sdslkmnscd scsun dc mcsaducsdnmlmc icmmklmdsc!"
result = detector.FindTopNMostFreqLangs(text=sample, num_langs=5)
for i in result:
    print(i.language, i.is_reliable, i.proportion, i.probability)```

will surprisingly output this:
`en True 0.4444444477558136 0.9999370574951172
bg True 0.28070175647735596 0.9173890948295593
hu True 0.27485379576683044 0.9084945917129517
und False 0.0 0.0
und False 0.0 0.0`

for one part good text and second part garbage it depends which is first and which has bigger proportion but the result can be correctly interpreted, however the above example is quite bad.

rvencu changed the title ~~Did I hit a bug in gcld3~~ Did I hit a bug in gcld3? Nov 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Did I hit a bug in gcld3? #64

Did I hit a bug in gcld3? #64

rvencu commented Nov 3, 2021

Did I hit a bug in gcld3? #64

Did I hit a bug in gcld3? #64

Comments

rvencu commented Nov 3, 2021