Text Reading: Extraction Framework

Extraction Framework

We extract entities and events from free text using the rule-based event extraction framework Odin (for details, see the associated github Wiki and the user manual. Extracted entities and events are represented as Mentions objects that contain all the essential information about the entity and the event, including the text of the mention, the trigger⁺ (if present), the arguments, and more. The number of arguments in each mention depends on the type of event extracted. The explanations on the entities and events currently extracted can be found on this page. We apply this extraction framework to reading the text of scientific publications, source comments, and model documents (i.e., READMEs).

Extraction framework includes the following components:

In the following sections, each of these components would be explained in detail.

⁺ A trigger is a word or a phrase that is used in a rule-based system to signal the presence of an event of interest.

Engines

There are three types of engines within the extraction framework.

See Text Reading: Setup for more information on getting started with a particular engine.

TextEngine

The TextEngine contains rules used for extracting information from scientific publications. The rules match:

Entities
- Locations
- Dates
- Units
- Models
- Filenames
- Repositories
Events

MarkdownEngine

The MarkdownEngine contains rules used to extract information from and about commands from model documents (i.e., READMEs). The rules match:

Entities
- Filenames
- Repositories
- Commands
- Command Line Parameters
Events

CommentEngine

The CommentEngine contains rules used to extract information from source comments. The rules match:

Identifiers
- Descriptions
- Units

Odin grammar rules

The rule-writing framework is flexible. It allows the user to extract events with multiple arguments, use dependency graphs or surface patterns, add constrains to extractions using regular expressions, incorporate previously extracted events, and more. As a gentle introduction, here we provide two examples of the rules we use in our reading system (see all the rules here).

An example of a dependency-based rule:

  - name: var_cop_definition                        #has to be unique
    label: Definition                               #the event label
    priority: Int                                   #shows the order in which rules (or rule batches) are going to be applied 
    type: dependency                                 #rule based on a dependency graph
    example: "LAI is the actual leaf area index"
    action: ${action}                                #an action (or a heuristic) to apply after the event is extracted with a rule to filter out undesired output 
    pattern: |
      trigger = [lemma="be"]                         #a word/lemma/pos-tag that is indicative of the event 
      variable:Concept = (<cop /${agents}/ | <cop <dep appos ) [!entity = /NUMBER|B-unit/ & !word = "=" & !word = ","] #<argument 1>:<the type of entity the argument can be>; the argument can be reached through the dependency relation paths 
      definition: Concept = <cop (?! case) nmod_for? compound? [!entity = /NUMBER|B-unit/] #argument 2

The rule extracts a Definition event with two arguments: variable (here LAI) and definition (here leaf area index).

An example of a token-based rule:

  - name: var_definition
    label: Definition
    priority: Int
    type: token
    example: "EEQ Equilibrium evaporation (mm/d)"
    pattern: |
       @variable:Variable (?<definition> [word = /.*/ & !tag="-LRB-"]+)

This rules is used to extract events from source code comments. Here, we extract a definition event that includes two arguments: a previously extracted ¹ Variable (here EEQ), which will be the variable argument, and a previously unseen ² definition (here Equilibrium evaporation).

¹ following the pattern: @<arg_name>:<Label_of_previously_found_Mention>

² the syntax to assign a label to a pattern: (?<label> pattern)

Rule priority

Rules are not applied simultaneously, but are applied in a fixed order. This order (or priority) can be set either directly in each rule or in the master file, considering the interactions among the rules. Below shows the places where rule priority can be set.

Priority setting within each rule

 - name: duration
   label: Unit
   priority: ${priority} # Here, the priority can be set directly as a number or as a template string (${priority}) and get the value from master file (example shown below) 
   type: token
   example: "It's been 7-10 days."
   pattern: |
     (?<= [tag = "CD"]) [entity="DURATION" & !chunk = "O"]

Priority setting in master file

  - import: "org/clulab/aske_automates/grammars/units.yml"
    vars:
      priority: 7 # this means that the rules in the units.yml file are applied in the 7th order.

Odin Actions

Odin Actions are Scala methods used to refine the mentions extracted by the Odin grammar rules. These actions can modify the mention structures, filter out erroneous or redundant mentions, add attachments to mentions, or create new mentions. Below shows an example of a simple action.

Odin Action example

  def filterTooLongParam(mentions: Seq[Mention], state: State = new State()): Seq[Mention] = {
    # This action filters out any mention that are too long
    # The name of the action, its parameters (mentions, state), and its return type is specified as the above.
    val filtered_output = mentions.filter(m => m.text.length < 40) # The length of the parameter should be less than 40.
    filtered_output # returned output
  }

These actions can be applied in multiple places; They can be applied directly in each rule, or in the master file, or in the OdinEngine file (with in the extractFrom method).

To see the full actions currently used, follow the link here. To learn more about the Odin Actions, refer to the section 6 ("Advanced: Customizing Rule Output with Actions") in the user manual.

Expansion handler

Expansion handler is used to complement the incomplete mention arguments. It allows mention arguments to be expanded if it is linked with other tokens through a "valid" dependency relation. This "validity" of dependency relation can be specified in the object ExpansionHandler using the regular expression. The example below shows how the expansion works.

Before the expansion

After the expansion

In the example above, the output argument "severity" is expanded to "severity of the pandemic", as the tokens "severity" and "pandemic" are linked through a valid dependency relation (nmod_of).

The valid set of dependency relations differ among the mention types. This difference is marked as expansionType. Currently, there are three expansionTypes: standard, function, and modelDescr. Function mentions and ParameterSetting mentions fall under the function type, and ModelDescription mentions fall under the modelDescr type. All the other mentions are the standard type. To see the full features of Expansion handler, follow the link here.

Attachments

There is information that is useful, but cannot be stored in mention structure. This information can be turned into a ujson.Value and stored as attachments to the mentions. For instance, context information can be provided as attachments to some types of mentions. Also, the inclusiveness/exclusiveness of the threshold in IntervalParameterSetting mention is provided as an attachment. See below for the examples demonstrating such cases.

Context attachment example

IntervalParameterSetting attachment example

Serialization

We serialize (save to file) and deserialize (load) mentions using Automates JSON Serializer, which is a ujson implementation of Processors Json Serializer, modified to account for some changes in mention structure and handle attachments.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly