Skip to content

Extraction Operations

adunofaiur edited this page May 31, 2014 · 1 revision

Meta-metadata wrappers support a variety of operations to ease with data extraction and post-extraction manipulation. Some of these are known as Semantics Actions, which support loops and if statements. See semantics actions for details. Everything else is cataloged here.

this page is a work in-progress. Information on this page may be incomplete or wildly inaccurate

Table of Contents

Field Parser

Field parsers enabled advanced manipulation of fields with regular expressions. In particular, it can be used to split a single string into multiple fields.

<composite name="artist_info" type="getty_artist_info">
  <xpath>//p[@class='bio'][1]</xpath>
  <field_parser name="regex_find" regex="(.*)(\n.*)(\n.*)(\n.*)(\n.*)" />
  <scalar name="lifespan" field_parser_key="$1" />
  <scalar name="professions" field_parser_key="$3" />
  <scalar name="languages" field_parser_key="$5" />
</composite>
In the above example, the fielder parser finds five values from the string at the provided xpath (each value is delineated by a set of parenthesis). Each scalar has a field_parser_key which indicates that it should receive a particular value.

A parser can also be used in conjunction with the regex_split option, which returns two fields ($0 and $1).

<collection name="authors" label="authors">
  <xpath>ul/li[@class='author']</xpath>
  <field_parser name="regex_split" regex=",\s" trim="true" />
  <scalar name="title" field_parser_key="$0" />
</collection>

Filters

Filter location performs on operations on the current document's location (URL, typically).It supports the following operations:

  • strip_param - Removes the parameter and its value from the document's location.
  • set_param - Sets the value of a parameter
  • regex - matches and replaces parts of the location
  • alternative_host - does something
  <filter_location>
    <set_param name="preflayout" value="flat" />
    <strip_param name="coll" />
    <strip_param name="dl" />
    <strip_param name="CFID" />
    <strip_param name="CFTOKEN" />
    <regex match="id=\d+\.(\d+)" replace="id=$1" />
    <alternative_host>portal.acm.org</alternative_host>
    <alternative_host>dl.acm.org</alternative_host>
  </filter_location>

Concatenate Values

The concatenate_values tag lets you concatenate the date from multiple fields into one super field.

    <scalar name="pre_description">
      <xpath>./a/text()</xpath>
      <regex_op regex="^.*$"
        replace="http://www.fondation-langlois.org/html/e/research.php?Filtres=1&amp;MotsCles=" />
    </scalar>
    <scalar name="mid_description">
      <xpath>./a/text()</xpath>
      <regex_op regex="\s" replace="+" />
    </scalar>
    <scalar name="post_description">
      <xpath>./a/text()</xpath>
      <regex_op regex=".*" replace="&amp;Numero=&amp;zoom=1&amp;Format=1" />
    </scalar>


    <scalar name="location">
      <concatenate_values>
        <value from_scalar="pre_description" />
        <value from_scalar="mid_description" />
        <value from_scalar="post_description" />
      </concatenate_values>
    </scalar>

Other

url_generator

<url_generator type="search" engine="acm_portal" use_id="title" />