Tutorial on Data Extraction

Table of Contents Overview Attaching Extraction Rules XPath Expressions Using Selectors Using URL Patterns Using MIME Type or Suffix Selector priority URL filtering

Overview

This tutorial will show you how to create a wrapper specific to UrbanSpoon, completing with the necessary information extraction rules.

Attaching Extraction Rules

We will create a new wrapper named urban_spoon_restaurant that inherits from the "restaurant" type we created in the last tutorial.

This key idea here is that, we can have one base wrapper (restaurant) that defines the generic data structure for all kinds of restaurants, and then derive different wrappers from that base wrapper for different websites, since we will need different extraction rules for each one of them -- each website will use its unique visual styles and page layout.

More abstractly, this strategy separates data structure definitions from site-specific extraction rules. This separation of concerns promotes reusability and maintainability in practice.

The new wrapper can be placed in the same file after wrapper restaurant.

The new wrapper will contain all the fields we defined for in restaurant, as well as fields inherited from base types such as compound_document since we want to apply extraction rules for them. It will look like this, with blank XPath expressions for now:

<meta_metadata name="urban_spoon_restaurant" type="restaurant" parser="xpath"
  comment="UrbanSpoon restaurant description page">
  <selector url_path_tree="http://www.urbanspoon.com/r/" />
  <example_url url="http://www.urbanspoon.com/r/114/875031/restaurant/College-Station/Christophers-World-Grill-Bryan" />
  
  <scalar name="title">
    <xpath></xpath>
  </scalar>
  <scalar name="phone">
    <xpath></xpath>
  </scalar>
  <scalar name="pic">
    <xpath></xpath>
  </scalar>
  <scalar name="rating">
    <xpath></xpath>
  </scalar>
  <scalar name="price_range">
    <xpath></xpath>
  </scalar>
  <scalar name="map">
    <xpath></xpath>
  </scalar>
 <collection name="genres">
    <xpath></xpath>
    <scalar name="title">
      <xpath></xpath>
    </scalar>
    <scalar name="location">
      <xpath></xpath>
    </scalar>
  </collection>
</meta_metadata>

The following key changes were made from the previous tutorial:

Element selector is added. We will address it in this tutorial after explaining extraction rules.
Element example_url provides a way for attaching an exemplar input URL for this wrapper. Multiple example_url elements are allowed. It is a good practice to attach exemplar URLs to authored wrappers, since they help testing and maintenance.
Attribute type is used instead of extends, which means using the data structure defined by that wrapper (restaurant in the example) directly and only (in terms of its attributes and fields), without defining any new fields. In other words, this wrapper does not define an actually new type; it is only reusing an existing type.
Attribute parser="xpath" means this wrapper will use the XPath parser to extract metadata from the raw HTML. When the data is presented as XML, such as RSS or a web service API, direct binding (parser="direct") can be used to directly map XML tree to metadata;
Xpaths are XML fields. You may add multiple xpath fields if, for example, the name of a restaurant may be in one of two places on a web page, but you cannot guarantee which it will be in.

Note that when a wrapper is reusing another one directly through type, one cannot define new fields, or use nested meta-metadata or generic fields in it. In those cases one will need to use extends to create a new type. We will talk about the two advanced topics later.

Also note that there are now two fields in genres. Remember that genres was defined with child_type="compound_document". These are the fields from the compound_document type that we will be using, title - String and location - ParsedURL, to store the name of the food genre and a link to the respective UrbanSpoon search page. When including collections of web pages, regardless of the source page's type, using compound_document with the title and location fields will allow for the user to expand those webpages using MICE.

XPath Expressions

If you are not familiar with XPath expressions please visit this tutorial.

We will look closely at the XPath for three distinct elements: title, pic, and genres.

Using the Chrome web inspecter we can see that the restaurant's title in the HTML:

<div class='hreview-aggregate' id='directory'>
  <div class='item'>
  <h1 class='page_title fn org'>Christopher's World Grill</h1>
  ...

Therefore a correct XPath expression for extracting the restaurant title would be: <xpath>//div[@id='directory']/div/h1</xpath>

Strategy 1: Using the class and id attributes to precisely specify the piece of information you want.

Many websites uses meaningful class and id attributes to layout their web pages. Proper use of them can make your XPath expressions more accurate and succinct, and more possible to endure visual style changes made by the website.

Strategy 2: If you want to extract text from a node that is formatted, ie, some of the text may be strong, use the xpath "../parentNode" instead of "../parentNode/text()".

Now we will look at extracting an image source URL.

<a href="/rph/114/875031/27276/college-station-christopher-s-world-grill-christopher-s-world-grill-photo">
  <img alt="Christopher's World Grill" height="130" src='/images/1/blank.gif' srcd="http://a1.urbancdn.com/w/s/ht/n1Pc0RahTuNJQK-130.jpg" width="130" />
</a>

There are many images on the page and none of the <div>'s class or id attributes are very helpful. Instead of forming a path like the previous one I used XPather to help me deduce the correct expression: <xpath> //div[@id='aside']/div[@class='list]/ul[1]/li[1]/ div[@class='photo']/div[@class='image']/a/img/@srcd </xpath> Note the srcd attribute is what we really need.

The last XPath we will explore will be for genres. Because the genres field is a collection, the XPath will need to be for a list of nodes. Looking through the HTML and using XPather I have formulated a correct expression: <xpath> //div[@id='secondary']/div[@class='cuisines']/fieldset/a </xpath>

It returns three nodes, one for each of the listed genres.

For each genre we want the name as a String and the link to its respective search page as a ParsedURL. We can use relative XPaths for fields nested in a composite or collection field, though global XPaths are still allowed. Here these XPaths will be: <xpath> ./text() </xpath> and <xpath> ./@href </xpath>

The XPath expressions for title and phone begin with the same sequence (that is, the two nodes are inside a common parent node). We can define a variable to store the common component of the XPath expressions to reduce repetition and assist maintenance. We use the def_var tag to define a variable of type node with the common XPath component.

<def_var name="main_block" type="node">
 <xpath>//div[@id='center']/div[1]</xpath>
</def_var>

We can then reference this variable in the context_node attribute for title and phone, and shorten the XPath expressions for these fields to contain only the unique ending:

<scalar name="title" context_node="main_block" />
  <xpath>./h1</xpath>
</scalar>
<scalar name="phone" context_node="main_block">
  <xpath>./h3</xpath>
</scalar>

Here shows the full wrapper:

<meta_metadata name="urban_spoon_restaurant" type="restaurant" parser="xpath"
  comment="UrbanSpoon restaurant description page">
  <selector url_path_tree="http://www.urbanspoon.com/r/" />
  <example_url url="http://www.urbanspoon.com/r/114/875031/restaurant/College-Station/Christophers-World-Grill-Bryan" />
   <def_var name="main_block" type="node">
     <xpath>//div[@id='center']/div[1]</xpath>
   </def_var>
  
  <scalar name="title" context_node="main_block">
    <xpath>./h1</xpath>
  </scalar>
  <scalar name="phone" context_node="main_block">
    <xpath>./h3</xpath>
  </scalar>
  <scalar name="pic">
    <xpath>//div[@id='aside']/div[@class='list photos']/ul[1]/li[1]/div[@class='photo']/div[@class='image']/a/img/@srcd</xpath>
  </scalar>
  <scalar name="rating">
    <xpath>//div[@id='vote_block']/div[@class='score up']/div[@class='number']/span[@class='digits percent-text rating average']</xpath>
  </scalar>
  <scalar name="price_range">
    <xpath>//div[@id='secondary']/div[@class='menu']/fieldset/div[@class='price']/span[@class='pricerange']  </xpath>
  </scalar>
  <scalar name="map" context_node="main_block>
    <xpath>./div[@class='address adr']/a[1]/@href</xpath>
  </scalar>
 <collection name="genres">
    <xpath>//div[@id='secondary']/div[@class='cuisines']/fieldset/a</xpath>
    <scalar name="title">
      <xpath>./text()</xpath>
    </scalar>
    <scalar name="location">
      <xpath>./@href</xpath>
    </scalar>
  </collection>
</meta_metadata>

Note that XPaths may need change because websites may update their layout.

Using Selectors

There is still a problem unresolved: with many wrappers in the repository, how does BigSemantics knows which to use for a specific information resource?

The solution is a delicate structure called selectors that helps BigSemantics automatically pick the appropriate wrapper for an input URL. The involving factors that BigSemantics considers to make this choice includes URL patterns, MIME types, and suffixes.

One wrapper can have multiple selectors, to specify multiple situations that this wrapper should be used. If the input URL matches any of the selectors for a wrapper, the wrapper will be used.

Note: ampersands in URL's will cause an error. Replace any '&' with "&_amp;" (without the '_').

Using URL Patterns

The most widely used selectors specify the URL pattern for information resources that a wrapper should handle. There are 3 types of patterns:

url_stripped

This matches against the part of the URL without queries (the part before question mark).

url_path_tree

This matches against the beginning part of the hierarchy in a URL. This has been used in the example with UrbanSpoon.

url_regex and url_regex_fragment

These allow you to write a Regular Expression to match against the input URL. The former must match the whole input URL, while the latter can match a fragment of the input URL. Expressions must be compliant with the Javascript flavor of regex, so things like lookbehinds are will not work. When these are used, you must specify the domain attribute for the selector to limit matching for a specific domain, for performance considerations.

When url_stripped or url_regex is used, the selector can use inner element param to further matches URL query parameters. For example:

<selector url_stripped="http://www.nsf.gov/awardsearch/advancedSearchResult">
  <param name="PILastName" />
</selector>

matches URLs with the specified trunk and with a parameter named "PILastName" in its query part.

<selector url_regex="https?://www.google.com/search\?.*" domain="google.com">
  <param name="tbm" value="isch" />
</selector>

matches URLs with the specified regular expression and with a parameter "tbm", whose value is "isch", in its query part.

Using MIME Type or Suffix

Selectors can also select by MIME types or suffixes. For example, the following selector matches common image formats:

<selector>
  <mime_type>image/jpeg</mime_type>
  <mime_type>image/png</mime_type>
  <mime_type>image/gif</mime_type>
  <mime_type>image/bmp</mime_type>
  <suffix>jpg</suffix>
  <suffix>jpeg</suffix>
  <suffix>gif</suffix>
  <suffix>png</suffix>
  <suffix>bmp</suffix>
</selector>

Selector priority

Different kinds of selectors have different priority when used together.

URL patterns will be considered first. Out of URL patterns, url_stripped has the highest priority. If matching by url_stripped failed, url_path_tree will be considered next. If matching failed again, url_regex will be used, then url_regex_fragment. Parameters will only be considered when the enclosing URL pattern selector is activated.

If URL patterns do not match any particular wrapper, suffix will be considered.

If neither URL patterns or suffixes match any particular wrapper, the default wrapper compound_document will be used.

When BigSemantics connects to the page, if MIME type is provided by the server, it will try to use MIME type to match a particular wrapper if previous attempts for matching a wrapper failed.

URL filtering

Some websites attaches extra URL query parameters in their links, to facilitate session management and tracking. This usually makes it hard to compare URLs, because two URLs with different session IDs or tracking information may actually point to the same information resource.

BigSemantics allows you use location filters for a wrapper to filter out those query parameters. This happens after an input URL is matched with the wrapper, but before any other processing.

Location filters support the following actions:

set_param: Adding or setting a query parameter with a specific name, and optionally a specific value.
strip_param: Removing unwanted query parameters by name.
regex: Matching and replacing arbitrary strings in the URL. This is really flexible.
alternative_host: Specifying that the same information resource can be addressed by alternative hosts.

For example, here shows the location filter for ACM Digital Libraries:

<filter_location>
  <strip_param name="coll" />
  <strip_param name="dl" />
  <strip_param name="CFID" />
  <strip_param name="CFTOKEN" />
  <set_param name="preflayout" value="flat" />
  <regex match="id=\d+\.(\d+)" replace="id=$1" />
  <alternative_host>portal.acm.org</alternative_host>
  <alternative_host>dl.acm.org</alternative_host>
</filter_location>

Considering an input URL http://portal.acm.org/citation.cfm?id=1871437.1871580&coll=DL&dl=ACM&CFID=178910658&CFTOKEN=90231150 is matched with this wrapper, and is being filtered now:

First, parameters such as "coll", "dl", "CFID", and "CFTOKEN" will be removed, resulting in http://portal.acm.org/citation.cfm?id=1871437.1871580
Then, a parameter "preflayout" will be set to "flat", resulting in http://portal.acm.org/citation.cfm?id=1871437.1871580&preflayout=flat
The numeric ID matches with the regular expression, thus will be replaced, resulting in http://portal.acm.org/citation.cfm?id=1871580&preflayout=flat
Finally, the system acknowledges that a different host, dl.acm.org, may be used to address the same information resource. That is essentially to say: if you see http://dl.acm.org/citation.cfm?id=1871580&preflayout=flat later, it is the same thing.

Note that the order of operations are not guaranteed, thus they must be independent on each other.

Now you've learned how to create data definitions (types, schemas, data models, or whatever you would like to call them) and attach extraction rules for specific information sources using wrappers.

If you are a developer, you can now take a look at compiling wrappers and running MmTest to see how it works.

The next tutorial will explain how to use semantic actions to implement control flows on extracted metadata, and connect them to your own application.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly