Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: add language detection #33

Open
wants to merge 7 commits into
base: master
Choose a base branch
from
Open

Feat: add language detection #33

wants to merge 7 commits into from

Conversation

axherrm
Copy link
Contributor

@axherrm axherrm commented Mar 10, 2023

Über lingua werden die folgenden 2 Werte in der result map gesetzt:

  • MOST_LIKELY_TEXT_LANGUAGE: erkannte Sprache oder UNKNOWN, falls das Ergebnis nicht sicher genug ist
  • TEXT_LANGUAGE_CONFIDENCE_VALUES: Konfidenz Werte für erkannte Sprachen, in absteigender Reihenfolge, fängt mit der MOST_LIKELY_TEXT_LANGUAGE mit dem Wert 1.0 an

Das Feature muss über die Umgebungsvariable org.jadice.filetype.matchers.PDFMatcher.languageCheck aktiviert werden.
Nach aktualler Konfiguration werden alle verfügbaren Sprachen in Betracht gezogen (75), es wäre aber aus Performancegründen durchaus sinnvoll, das einzuschränken, falls möglich:

// include all languages available in the library
// WARNING: in the worst case this produces high memory 
//          consumption of approximately 3.5GB 
//          and slow runtime performance
//          (in high accuracy mode)
LanguageDetectorBuilder.fromAllLanguages()

// include only languages that are not yet extinct (= currently excludes Latin)
LanguageDetectorBuilder.fromAllSpokenLanguages()

// include only languages written with Cyrillic script
LanguageDetectorBuilder.fromAllLanguagesWithCyrillicScript()

// exclude only the Spanish language from the decision algorithm
LanguageDetectorBuilder.fromAllLanguagesWithout(Language.SPANISH)

// only decide between English and German
LanguageDetectorBuilder.fromLanguages(Language.ENGLISH, Language.GERMAN)

// select languages by ISO 639-1 code
LanguageDetectorBuilder.fromIsoCodes639_1(IsoCode639_1.EN, IsoCode639_3.DE)

// select languages by ISO 639-3 code
LanguageDetectorBuilder.fromIsoCodes639_3(IsoCode639_3.ENG, IsoCode639_3.DEU)

Außerdem habe ich es erstmal so eingestellt, dass ab einer Textlänge von 120 Zeichen lLanguageDetectorBuilder.withLowAccuracyMode() benutzt wird, da so kleinere Datensätze benutzt werden. Unter 120 Zeichen soll das laut Doku zu große Ungenauigkeit zur Folge haben.

@axherrm
Copy link
Contributor Author

axherrm commented Mar 10, 2023

@axherrm axherrm force-pushed the feat/detect-language branch from e0ad60d to 15e0653 Compare March 10, 2023 14:03
Base automatically changed from feat/pdf-contains-text to master March 10, 2023 14:03
@axherrm axherrm force-pushed the feat/detect-language branch from 15e0653 to 868c7e1 Compare March 10, 2023 14:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant