Skip to content

Selectors

adunofaiur edited this page May 31, 2014 · 4 revisions

How does BigSemantics knows which Wrapper to use for a specific information resource?

The solution is a delicate structure called selectors that helps BigSemantics automatically pick the appropriate wrapper for an input URL. The involving factors that BigSemantics considers to make this choice includes URL patterns, MIME types, and suffixes.

One wrapper can have multiple selectors, to specify multiple situations that this wrapper should be used. If the input URL matches any of the selectors for a wrapper, the wrapper will be used.

Note: ampersands in URL's will cause an error. Replace any '&' with "&_amp;" (without the '_').

Table of Contents

Using URL Patterns

The most widely used selectors specify the URL pattern for information resources that a wrapper should handle. There are 3 types of patterns:

url_stripped
This matches against the part of the URL without queries (the part before question mark).
url_path_tree
This matches against the beginning part of the hierarchy in a URL. This has been used in the example with UrbanSpoon.
url_regex and url_regex_fragment
These allow you to write a Regular Expression to match against the input URL. The former must match the whole input URL, while the latter can match a fragment of the input URL. When these are used, you must specify the domain attribute for the selector to limit matching for a specific domain, for performance considerations.
<selector url_regex_fragment="http://www.amazon.com/[^/]*/dp/[^/]*" domain="amazon.com" />

When url_stripped or url_regex is used, the selector can use inner element param to further matches URL query parameters. For example:

<selector url_stripped="http://www.nsf.gov/awardsearch/advancedSearchResult">
  <param name="PILastName" />
</selector>
matches URLs with the specified trunk and with a parameter named "PILastName" in its query part.
<selector url_regex="https?://www.google.com/search\?.*" domain="google.com">
  <param name="tbm" value="isch" />
</selector>
matches URLs with the specified regular expression and with a parameter "tbm", whose value is "isch", in its query part.

Using MIME Type or Suffix

Selectors can also select by MIME types or suffixes. For example, the following selector matches common image formats:

<selector>
  <mime_type>image/jpeg</mime_type>
  <mime_type>image/png</mime_type>
  <mime_type>image/gif</mime_type>
  <mime_type>image/bmp</mime_type>
  <suffix>jpg</suffix>
  <suffix>jpeg</suffix>
  <suffix>gif</suffix>
  <suffix>png</suffix>
  <suffix>bmp</suffix>
</selector>

Selector priority

Different kinds of selectors have different priority when used together.

URL patterns will be considered first. Out of URL patterns, url_stripped has the highest priority. If matching by url_stripped failed, url_path_tree will be considered next. If matching failed again, url_regex will be used, then url_regex_fragment. Parameters will only be considered when the enclosing URL pattern selector is activated.

If URL patterns do not match any particular wrapper, suffix will be considered.

If neither URL patterns or suffixes match any particular wrapper, the default wrapper compound_document will be used.

When BigSemantics connects to the page, if MIME type is provided by the server, it will try to use MIME type to match a particular wrapper if previous attempts for matching a wrapper failed.

URL filtering

Some websites attaches extra URL query parameters in their links, to facilitate session management and tracking. This usually makes it hard to compare URLs, because two URLs with different session IDs or tracking information may actually point to the same information resource.

BigSemantics allows you use location filters for a wrapper to filter out those query parameters. This happens after an input URL is matched with the wrapper, but before any other processing.

Location filters support the following actions:

  • set_param: Adding or setting a query parameter with a specific name, and optionally a specific value.
  • strip_param: Removing unwanted query parameters by name.
  • regex: Matching and replacing arbitrary strings in the URL. This is really flexible.
  • alternative_host: Specifying that the same information resource can be addressed by alternative hosts.
For example, here shows the location filter for ACM Digital Libraries:
<filter_location>
  <strip_param name="coll" />
  <strip_param name="dl" />
  <strip_param name="CFID" />
  <strip_param name="CFTOKEN" />
  <set_param name="preflayout" value="flat" />
  <regex match="id=\d+\.(\d+)" replace="id=$1" />
  <alternative_host>portal.acm.org</alternative_host>
  <alternative_host>dl.acm.org</alternative_host>
</filter_location>

Considering an input URL http://portal.acm.org/citation.cfm?id=1871437.1871580&coll=DL&dl=ACM&CFID=178910658&CFTOKEN=90231150 is matched with this wrapper, and is being filtered now:

Note that the order of operations are not guaranteed, thus they must be independent on each other.