-
Notifications
You must be signed in to change notification settings - Fork 5
Selectors
How does BigSemantics knows which Wrapper to use for a specific information resource?
The solution is a delicate structure called selectors that helps BigSemantics automatically pick the appropriate wrapper for an input URL. The involving factors that BigSemantics considers to make this choice includes URL patterns, MIME types, and suffixes.
One wrapper can have multiple selectors, to specify multiple situations that this wrapper should be used. If the input URL matches any of the selectors for a wrapper, the wrapper will be used.
Note: ampersands in URL's will cause an error. Replace any '&' with "&_amp;" (without the '_').
The most widely used selectors specify the URL pattern for information resources that a wrapper should handle. There are 3 types of patterns:
- url_stripped
- This matches against the part of the URL without queries (the part before question mark).
- url_path_tree
- This matches against the beginning part of the hierarchy in a URL. This has been used in the example with UrbanSpoon.
- url_regex and url_regex_fragment
- These allow you to write a Regular Expression to match against the input URL. The former must match the whole input URL, while the latter can match a fragment of the input URL. When these are used, you must specify the domain attribute for the selector to limit matching for a specific domain, for performance considerations.
<selector url_regex_fragment="http://www.amazon.com/[^/]*/dp/[^/]*" domain="amazon.com" />
When url_stripped or url_regex is used, the selector can use inner element param to further matches URL query parameters. For example:
<selector url_stripped="http://www.nsf.gov/awardsearch/advancedSearchResult">
<param name="PILastName" />
</selector>
<selector url_regex="https?://www.google.com/search\?.*" domain="google.com">
<param name="tbm" value="isch" />
</selector>
Selectors can also select by MIME types or suffixes. For example, the following selector matches common image formats:
<selector>
<mime_type>image/jpeg</mime_type>
<mime_type>image/png</mime_type>
<mime_type>image/gif</mime_type>
<mime_type>image/bmp</mime_type>
<suffix>jpg</suffix>
<suffix>jpeg</suffix>
<suffix>gif</suffix>
<suffix>png</suffix>
<suffix>bmp</suffix>
</selector>
Different kinds of selectors have different priority when used together.
URL patterns will be considered first. Out of URL patterns, url_stripped has the highest priority. If matching by url_stripped failed, url_path_tree will be considered next. If matching failed again, url_regex will be used, then url_regex_fragment. Parameters will only be considered when the enclosing URL pattern selector is activated.
If URL patterns do not match any particular wrapper, suffix will be considered.
If neither URL patterns or suffixes match any particular wrapper, the default wrapper compound_document
will be used.
When BigSemantics connects to the page, if MIME type is provided by the server, it will try to use MIME type to match a particular wrapper if previous attempts for matching a wrapper failed.
Some websites attaches extra URL query parameters in their links, to facilitate session management and tracking. This usually makes it hard to compare URLs, because two URLs with different session IDs or tracking information may actually point to the same information resource.
BigSemantics allows you use location filters for a wrapper to filter out those query parameters. This happens after an input URL is matched with the wrapper, but before any other processing.
Location filters support the following actions:
- set_param: Adding or setting a query parameter with a specific name, and optionally a specific value.
- strip_param: Removing unwanted query parameters by name.
- regex: Matching and replacing arbitrary strings in the URL. This is really flexible.
- alternative_host: Specifying that the same information resource can be addressed by alternative hosts.
<filter_location>
<strip_param name="coll" />
<strip_param name="dl" />
<strip_param name="CFID" />
<strip_param name="CFTOKEN" />
<set_param name="preflayout" value="flat" />
<regex match="id=\d+\.(\d+)" replace="id=$1" />
<alternative_host>portal.acm.org</alternative_host>
<alternative_host>dl.acm.org</alternative_host>
</filter_location>
Considering an input URL http://portal.acm.org/citation.cfm?id=1871437.1871580&coll=DL&dl=ACM&CFID=178910658&CFTOKEN=90231150 is matched with this wrapper, and is being filtered now:
- First, parameters such as "coll", "dl", "CFID", and "CFTOKEN" will be removed, resulting in http://portal.acm.org/citation.cfm?id=1871437.1871580
- Then, a parameter "preflayout" will be set to "flat", resulting in http://portal.acm.org/citation.cfm?id=1871437.1871580&preflayout=flat
- The numeric ID matches with the regular expression, thus will be replaced, resulting in http://portal.acm.org/citation.cfm?id=1871580&preflayout=flat
- Finally, the system acknowledges that a different host, dl.acm.org, may be used to address the same information resource. That is essentially to say: if you see http://dl.acm.org/citation.cfm?id=1871580&preflayout=flat later, it is the same thing.