-
Notifications
You must be signed in to change notification settings - Fork 9
Tutorial on advanced wrapper features
This tutorial will talk about advanced language features provided by meta-metadata, including regular expression based filtering and extraction, nested meta-metadata, and generic fields.
If you are not familiar with regular expression please refer to this tutorial.
It is common that extracted information is not in a immediately usable shape, and some post-processing is needed to normalize them into a uniform representation that can be used by the application. Regular expression is powerful tool for this task.
Meta-metadata supports filtering extracted information with regular expression in the form of an extra element filter inside a scalar field:
<scalar name="author_name" xpath="...">
<filter regex="[A-Z][a-z]+, [A-Z]\." />
</scalar>
In the above example, the regular expression will be used to extract a name in the form of "Jobs, S." from the extracted information.
It is also to replace the matched part with another one. In the following example, the leading "ISBN: " of the extracted information will be replaced by an empty string (thus removed):
<scalar name="isbn" xpath="...">
<filter regex="ISBN:\s+" replace="" />
</scalar>
Meta-metadata allows extracting information from a flat string into a composite or a collection element, using regular expression or other means, through an element field_parser.
The two basic field_parsers are regex_find and regex_split. The former applies to a composite, which takes the string for that composite (evaluated from XPath / direct binding tag name) as input, matches the input string against specified regular expression, and outputs indexed capture groups. Subfields nested inside the composite can specify field_parser_key in the form of $1,$2 to assign a capture group into that field. For example:
<composite name="citation_info" xpath="//h1">
<field_parser name="regex_find" regex="(\d+) citations .* (\d+) self" />
<scalar name="total_citation" field_parser_key="$1" />
<scalar name="self_citation" field_parser_key="$2" />
</composite>
In this example, the field parser uses XPath "//h1" to retrieve the citation information string (e.g. "40 citations -- 2 self") from CiteSeerX, matches it against the regular expression to capture the 2 numbers, and assigns them to the 2 nested scalar fields.
regex_split uses the regular expression as the delimiter to split the input string into a set of values, and assigns each of the value to an element in a collection (if there is more than one nested field in the collection, the field with field_parser_key set to $0 will have the value). For instance, the following example separates a list of comma (with spaces around) separated author names into a collection:
<collection name="authors" child_type="author">
<xpath>...</xpath>
<field_parser name="regex_split" regex="\s*,\s*" />
<scalar name="author_name" field_parser_key="$0" />
</collection>
The field_parser mechanism is extendable by deriving sub-classes from FieldParser and register them with name to FieldParserFactory. For example, bibtex handles input string in BibTeX format; regex_split_and_find allows you to combine the functionality of regex_split and egex_find. We encourage people to experiment with them or make your own parser when needed.
As nested class definition in Java or other OOP language, it is permitted to extend a new meta-metadata type from an existing one just at the place you need it -- typically on a composite or collection field. The nested meta-metadata type will be visible only in the encampusing meta-metadata and its subtypes, or inside the field itself. For example:
<meta_metadata name="search" extends="compound_document">
<collection name="search_results" child_type="search_result" />
</meta_metadata>
<meta_metadata name="delicious_search" extends="search">
<collection name="search_results" child_type="delicious_search_result" child_extends="search_result" >
<scalar name="author" scalar_type="String" />
<collection name="tags" child_scalar_type="String" />
</collection>
</meta_metadata>
In the above example, the new type delicious_search_result will only be visible to subclasses of delicious_search (including itself), or inside the field search_result itself.
When a composite (or collection) field is inherited in a sub meta-metadata type, it is possible to change its type (or child_type) to a more specific type. In other words, one can define the base field using a generic type and specify a concrete sub-type for that field in the derived meta-metadata. Let's look at an example:
<meta_metadata name="search" extends="compound_document">
<collection name="search_results" child_type="search_result" />
</meta_metadata>
<meta_metadata name="social_search" extends="search">
<collection name="search_results" child_type="social_search_result" >
</meta_metadata>
In the above example we assume that there is an independent type social_search_result which extends ordinary search_result with Social Network Service specific fields. social_search, which extends search, can just use social_search_result instead of search_result for the same field, to specify the concrete type used for this field in this context.
Note that a more fundamental support for generic fields is under active development. The syntax may change in near future.
This is not a complete list of all the advanced features meta-metadata supports for modeling, extracting, and using complex metadata semantics. There have been many unexpected and unusual challenges and subtleties to build such a system to support real world problems. For more details, please see our about the meta-metadata and S.IM.PL system.
The Reselect mechanism is a semantic action that allows the BigSemantics runtime to identify the type of a webpage using not just it's web address, but further, content in the page itself. A preliminary wrapper, based on URL, extracts a value (probably a String) from the page, which determines its actual type. Then, Reselect uses this extracted field value to select another wrapper to extract further information from the same page. This is necessary for cases in which different types of pages use the same URL pattern, and thus typical URL-based selectors cannot distinguish them.
The reselected wrapper will reuse the page DOM tree. Extracted metadata object will be added to the initial one as a mixin.
For example, for books on Amazon, we may want to extract not only general Amazon product information such as price and reviews, but also book specific information such as authors, publisher, and ISBN. However, all Amazon products share the same URL pattern, therefore we will use the reselect mechanism.
To use the reselect mechanism:
1. Add a <reselect_meta_metadata_and_extract> semantic action in the initial wrapper (<amazon_product> in our case). The initial wrapper should use a proper selector to match with pages of the general type (general Amazon product pages here):
<meta_metadata name="amazon_product" extends="product" parser="xpath">
<selector url_regex_fragment="http://www.amazon.com/[^/]*/dp/[^/]*" domain="amazon.com" />
<!-- general Amazon product fields -->
<semantic_actions>
<!-- optionally, use "name" to return the extracted metadata object for use in semantic actions -->
<reselect_meta_metadata_and_extract name="amazon_item" />
<!-- other semantic actions -->
</semantic_actions>
</meta_metadata>
2. Add a special selector in each of the wrappers that can be reselected for extracting extra fields. The selector needs to specify 1) the initial wrapper name, 2) a field name, which should be part of the initial wrapper, and 3) an expected value:
<meta_metadata name="amazon_book" extends="book" parser="xpath">
<selector meta_metadata_name="amazon_product">
<field name="department" value="Books" />
</selector>
<!-- book specific fields here -->
</meta_metadata>
When the initial wrapper is used to obtain an initial metadata object, the runtime will match the real value for that field with the expected value to reselect a proper wrapper.