Feat: add language detection #33

axherrm · 2023-03-10T13:03:57Z

Über lingua werden die folgenden 2 Werte in der result map gesetzt:

MOST_LIKELY_TEXT_LANGUAGE: erkannte Sprache oder UNKNOWN, falls das Ergebnis nicht sicher genug ist
TEXT_LANGUAGE_CONFIDENCE_VALUES: Konfidenz Werte für erkannte Sprachen, in absteigender Reihenfolge, fängt mit der MOST_LIKELY_TEXT_LANGUAGE mit dem Wert 1.0 an

Das Feature muss über die Umgebungsvariable org.jadice.filetype.matchers.PDFMatcher.languageCheck aktiviert werden.
Nach aktualler Konfiguration werden alle verfügbaren Sprachen in Betracht gezogen (75), es wäre aber aus Performancegründen durchaus sinnvoll, das einzuschränken, falls möglich:

// include all languages available in the library
// WARNING: in the worst case this produces high memory 
//          consumption of approximately 3.5GB 
//          and slow runtime performance
//          (in high accuracy mode)
LanguageDetectorBuilder.fromAllLanguages()

// include only languages that are not yet extinct (= currently excludes Latin)
LanguageDetectorBuilder.fromAllSpokenLanguages()

// include only languages written with Cyrillic script
LanguageDetectorBuilder.fromAllLanguagesWithCyrillicScript()

// exclude only the Spanish language from the decision algorithm
LanguageDetectorBuilder.fromAllLanguagesWithout(Language.SPANISH)

// only decide between English and German
LanguageDetectorBuilder.fromLanguages(Language.ENGLISH, Language.GERMAN)

// select languages by ISO 639-1 code
LanguageDetectorBuilder.fromIsoCodes639_1(IsoCode639_1.EN, IsoCode639_3.DE)

// select languages by ISO 639-3 code
LanguageDetectorBuilder.fromIsoCodes639_3(IsoCode639_3.ENG, IsoCode639_3.DEU)

Außerdem habe ich es erstmal so eingestellt, dass ab einer Textlänge von 120 Zeichen lLanguageDetectorBuilder.withLowAccuracyMode() benutzt wird, da so kleinere Datensätze benutzt werden. Unter 120 Zeichen soll das laut Doku zu große Ungenauigkeit zur Folge haben.

axherrm · 2023-03-10T13:29:50Z

Außerdem scheint lingua wirklich komplett offline zu funktionieren: "It does not need a connection to any external API or service either. Once the library has been downloaded, it can be used completely offline."

axherrm force-pushed the feat/detect-language branch from e0ad60d to 15e0653 Compare March 10, 2023 14:03

Base automatically changed from feat/pdf-contains-text to master March 10, 2023 14:03

axherrm added 6 commits March 10, 2023 15:10

add draft how checking for text could look like

5258c29

feat(JF-466): add language recognition

f3ba2ba

undo language recognition to move to another branch

4ef0a68

add language detection

4d11dd4

fix test

2d1d072

fix test

868c7e1

axherrm force-pushed the feat/detect-language branch from 15e0653 to 868c7e1 Compare March 10, 2023 14:16

remove unused variable in @BeforeAll

4e4c577

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat: add language detection #33

Feat: add language detection #33

axherrm commented Mar 10, 2023

axherrm commented Mar 10, 2023

Feat: add language detection #33

Are you sure you want to change the base?

Feat: add language detection #33

Conversation

axherrm commented Mar 10, 2023

axherrm commented Mar 10, 2023