Skip to content
rzanoli edited this page Jun 25, 2013 · 23 revisions

The edit distance EDA casts textual entailment as the problem of mapping the whole content of H into the content of T. Mappings are performed as sequences of editing operations (i.e. insertion, deletion, substitution of text portions) needed to transform T into H, where each edit operation has a cost associated with it. EditDistanceEDA uses the calculation made by the distance components to predict entailment/non-entailment relations among T-H pairs. In the current release of the platform these distance components are available:

• FixedWeightTokenEditDistance: a token-based version of the Levenshtein distance algorithm, with edit operations defined over sequences of tokens of T and H.

• FixedWeightLemmaEditDistance: a token-based version of the Levenshtein distance algorithm, with edit operations defined over sequences of lemmas of tokens of T and H.

Running EditDistanceEDA should not require additional installation or building steps apart from setting up the EOP. The remainder of this document describes the possible configurations for EditDistanceEDA.

Configuration File

We provide 3 configuration files located under /core/src/main/resources/configuration-file/:

• EditDistanceEDA_DE.xml (German language)

• EditDistanceEDA_EN.xml (English language)

• EditDistanceEDA_IT.xml (Italian language)

Each of the files (a file for each of the 3 different supported languages: English, German and Italian) contains different instances of the algorithm that can be tested. The structure and values in these configuration files are explained in the table below.

Common settings

Section Property Value Requirement
PlatformConfiguration activatedEDA The common setting for selecting the EDA. The default value here is eu.excitementproject.eop.core.EditDistanceEDA. N/A
PlatformConfiguration language For the moment, EditDistanceEDA supports English (EN), German (DE), and Italian (IT). In principle, the EDA is language-independent. N/A
PlatformConfiguration activatedLAP The linguistic analysis pipeline needed to produce input for the EDA. N/A
eu.excitementproject.eop.core.<br /> EditDistanceEDA modelFile The location where the trained model is stored. The default location is under core/src/main/resources/model/. We use a convention that gives informative names to the models -- they include the name of the EDA used to produce them, the language, and additional information regarding the settings used. For training, the model file should NOT exist.
eu.excitementproject.eop.core.<br /> EditDistanceEDA trainDir The directory containing the training data, as produced by the LAP (in xmi format). The directory should exist.
eu.excitementproject.eop.core.<br /> EditDistanceEDA testDir The directory containing the test data, as produced by the LAP (in xmi format). The directory should exist.
eu.excitementproject.eop.core.<br /> EditDistanceEDA components The component used by the EditDistanceEDA for distance computations. The components may require themselves additional parameters, which are specified in sections specific to each of them. These sections are identified through the name of the component provided as value through this XML tag. At present these components are available:
  1. FixedWeightTokenEditDistance
  2. Green
  3. FixedWeightLemmaEditDistance
N/A
eu.excitementproject.eop.core.<br /> component.distance.<br /> FixedWeightTokenEditDistance instances The component computes the distance between two strings using the tokens and fixed weights for the string edit operations. The instance specifies the value of a subsection, which contains the parameters needed to use this component.
eu.excitementproject.eop.core.<br /> component.distance.<br /> FixedWeightLemmaEditDistance instances The component computes the distance between two strings using the lemmas and fixed weights for the string edit operations. The instance specifies the value of a subsection, which contains the parameters needed to use this component. To be able to use this components, the LAP should provide token and lemma annotations (Currently only TreeTagger provides this for all three languages, and TextPro for Italian).
basic / wordnet stopWordRemoval Can be true or false, and indicates to the distance computation component whether to filter stop words or not
wordnet path The path to the particular WordNet resource used. The English WordNet is freely distributed and is included in the release. The Italian WordNet is also free but must be obtained through request from FBK. Details are provided in the Doc for the Italian knowledge resources. GermaNet is properietary. Details about the resource and how to obtain it are provided in the Doc for the German knowledge resources
weights match/delete/insert/substitute These are real valued weights for each string edit operation, used by the distance computation component.

Specific language settings

Do adjust the distance computation for a specific language, the user should use the wordnet value for the instances property of the distance computation component, and give the path to the desired resource in the corresponding subsection of the configuration file, as described above.