-
Notifications
You must be signed in to change notification settings - Fork 5
Extraction Operations
Meta-metadata wrappers support a variety of operations to ease with data extraction and post-extraction manipulation. Some of these are known as Semantics Actions, which support loops and if statements. See semantics actions for details. Everything else is cataloged here.
this page is a work in-progress. Information on this page may be incomplete or wildly inaccurate
Field parsers enabled advanced manipulation of fields with regular expressions. In particular, it can be used to split a single string into multiple fields.
<composite name="artist_info" type="getty_artist_info">
<xpath>//p[@class='bio'][1]</xpath>
<field_parser name="regex_find" regex="(.*)(\n.*)(\n.*)(\n.*)(\n.*)" />
<scalar name="lifespan" field_parser_key="$1" />
<scalar name="professions" field_parser_key="$3" />
<scalar name="languages" field_parser_key="$5" />
</composite>
A parser can also be used in conjunction with the regex_split option, which returns two fields ($0 and $1).
<collection name="authors" label="authors">
<xpath>ul/li[@class='author']</xpath>
<field_parser name="regex_split" regex=",\s" trim="true" />
<scalar name="title" field_parser_key="$0" />
</collection>
Filter location performs on operations on the current document's location (URL, typically).It supports the following operations:
- strip_param - Removes the parameter and its value from the document's location.
- set_param - Sets the value of a parameter
- regex - matches and replaces parts of the location
- alternative_host - does something
<filter_location>
<set_param name="preflayout" value="flat" />
<strip_param name="coll" />
<strip_param name="dl" />
<strip_param name="CFID" />
<strip_param name="CFTOKEN" />
<regex match="id=\d+\.(\d+)" replace="id=$1" />
<alternative_host>portal.acm.org</alternative_host>
<alternative_host>dl.acm.org</alternative_host>
</filter_location>
The concatenate_values tag lets you concatenate the date from multiple fields into one super field.
<scalar name="pre_description">
<xpath>./a/text()</xpath>
<regex_op regex="^.*$"
replace="http://www.fondation-langlois.org/html/e/research.php?Filtres=1&MotsCles=" />
</scalar>
<scalar name="mid_description">
<xpath>./a/text()</xpath>
<regex_op regex="\s" replace="+" />
</scalar>
<scalar name="post_description">
<xpath>./a/text()</xpath>
<regex_op regex=".*" replace="&Numero=&zoom=1&Format=1" />
</scalar>
<scalar name="location">
<concatenate_values>
<value from_scalar="pre_description" />
<value from_scalar="mid_description" />
<value from_scalar="post_description" />
</concatenate_values>
</scalar>
url_generator
<url_generator type="search" engine="acm_portal" use_id="title" />