Skip to content

GermanLexicalResources

brittazeller edited this page Jun 20, 2013 · 6 revisions

This page keeps information about German Lexical Resource modules (that implements LexicalResource interface) within EXCITEMENT open platform.

@TODO (update the document, Britta & Gil)

Table of Contents

List of German Lexical Resources

For the moment (release 1.0) there are three lexical resources within CORE of EOP. They are: - DEWakDistributional: A resource based on distributional similarities observed on DeWac corpus. The resource holds 10k most frequent terms and their inter-similarity, and returns lexical rules based on those similarities. - DerivBase: The resource holds various forms (inter-POSes) of related derivational words, and returns lexical rules based on this resource. - GermaNetWrapper: This is an implementation that interacts with GermaNet to generate lexical rules. Note that GermaNet itself is not provided, and the user has to install it to use.

DEWakDistributional (core.component.lexicalknowledge.dewakdistributional.GermanDistSim)

Introduction

This resource implements a German lexical resource based on corpus term distribution. It uses the distance vectors which have been gathered from DeWac, a web corpus for German. The vectors are based on the 10k most frequent words observed in the corpus. Similarity is calculated with five different similarity measures (balAPinc, lin, linOpt, jaccard, dice). Only pairs which achieve a predefined minimum similarity are stored in the resource (for balAPinc: .7, for lin: .6, for linOpt: .6, for jaccard: .8, for dice: .9). As a confidence score, the resource returns the distributional similarity score which has been calculated for the lemma-POS pairs. Thus, depending on the measure used, it lies between .6 and 1.0. The DEWakDistributional is a simple LexicalResource and does not support LexicalResourceWithRelation.

Configurable values

No values to configure.

DerivBase (core.component.lexicalknowledge.derivbase.DerivBaseResource)

Introduction

This resource implements a German Lexical Resource based on derivational information, DErivBase v1.3. The resource contains groups of lemmas, so-called derivational families, which share a morphologic (and ideally also a semantic) relationship, e.g. "sleep, sleepy, to sleep, sleepless". DErivBase has been generated by a rule-based approach: Content words from SdeWaC corpus are grouped into derivational families by help of manually written derivation rules. For Textual Entailment, we assume a bidirectional entailment relationship between two words which occur in the same derivational family.

The resource is accessed via the class core.component.lexicalknowledge.derivbase.DerivBase. It loads a information from one of two existing files: One contains only the derivational families; the other contains both the derivational families and confidence scores for all pairs of lemmas within one family. The confidence score within DErivBase reflects the connectedness of two lemmas within one family: The score is calculated as 1/n, where n is the length of the derivation path. Thus, 1.00 trusts only pairs which are directly linked by a rule; 0.5 trusts pairs which are linked by two rules; 0.33 trusts pairs which are linked by three rules, etc.

We transfer the DErivBase-internal confidence scores to the confidence scores as defined in the EXCITEMENT project: A confidence score of 0 means "no entailment", 0.5 means "don't know", 1.0 means "entailment". Since we assume that derivationally related lemmas also entail each other (bidirectionally), we map the scores from DErivBase into the scale 0.5-1.0. The following are some examples for DErivBase-internal and corresponding EXCITEMENT confidence scores:

DErivBase = 1.0; EXCITEMENT = 1.0

DErivBase = 0.5; EXCITEMENT = 0.75

DErivBase = 0.33; EXCITEMENT = 0.665 etc.

DerivBaseResource is a simple LexicalResource and does not support LexicalResourceWithRelation.

Configurable values

DerivBaseResource has a two configurable values, indicating if confidence scores for derivationally related lemma pairs in the resource should be used, and in which way.

Section Property Value Requirement
DerivBaseResource useScores Boolean value. Specifies if rule confidence scores should be used (true) or not (false). N/A
DerivBaseResource derivationSteps Integer value between 1 and 10. Specifies how many derivation steps are accepted to derive lemma l2 from lemma l1, and to still count this lemma pair as "entailment". Thus, this value influences how many lemma pairs of the DErivBase resource are considered in the EXCITEMENT platform. Is only effective if useScores is set to true.

GermaNetWrapper (core.component.lexicalknowledge.germanet.GermaNetWrapper)

Introduction

This class implements a German Lexical Resource based on GermaNet 7.0, which is the German WordNet. The implementation accesses GermaNet via GermaNet API. The implementation supports both LexicalResource and LexicalResourceWithRelation. For the relations, it supports both OwnRelationSpecifier (with GermaNetRelation; possible relation types: synonym, hypernym, hyponym, antonym, causes, entails) and CanonicalRelationSpecifier (possible relation types: TERuleRelation.Entailment or .Nonentailment). For each OwnRelation, a confidence score can be set. They can be set in the configuration. If a configuration is used, but the scores are not defined, the confidences for all relations are all set to 0.0 by default. If no configuration is used, the confidences for all relations are all set to 1.0 by default.

Note: The EXCITEMENT project cannot and do not redistribute GermaNet, and the user of this component must get it with a proper license agreement from Tuebingen University. The GermaNet API, however, is provided with the project.

Configurable values

The GermaNet resource has a few configurable values. Basically, it needs path to GermaNet data itself, and a set of double values that indicates "confidence" for each own relation when they are treated as "entailment".

Section Property Value Requirement
GermaNetWrapper germaNetFilesPath Path to the GermaNet resource, which has to be installed by the user on his own computer. N/A
GermaNetWrapper causesConfidence Indicates a confidence score on how reliable the GermaNet 'causes' relation is considered. Value between 0 and 1. Causes are only used for rules LHS - RHS. N/A
GermaNetWrapper entailsConfidence Indicates a confidence score on how reliable the GermaNet 'entails' relation is considered. Value between 0 and 1. Entails are only used for rules LHS - RHS. N/A
GermaNetWrapper hypernymConfidence Indicates a confidence score on how reliable the GermaNet 'hypernym' relation is considered. Value between 0 and 1. Hypernyms are only used for rules LHS - RHS. N/A
GermaNetWrapper synonymConfidence Indicates a confidence score on how reliable the GermaNet 'snonym' relation is considered. Value between 0 and 1. Synonyms are used for both rules RHS - LHS and LHS - RHS. N/A
GermaNetWrapper hypoymConfidence Indicates a confidence score on how reliable the GermaNet 'hyponym' relation is considered. Value between 0 and 1. Hyponyms are only used for rules RHS - LHS. N/A
GermaNetWrapper antonymConfidence Indicates a confidence score on how reliable the GermaNet 'antonym' relation is considered. Value between 0 and 1. DEPRECATED: This relation is deprecated, do not use it. N/A