-
Notifications
You must be signed in to change notification settings - Fork 9
Tutorial on Data Extraction
This tutorial will show you how to create a wrapper specific to UrbanSpoon, completing with the necessary information extraction rules.
We will create a new wrapper named urban_spoon_restaurant
that inherits from the "restaurant" type we created in the last tutorial.
This key idea here is that, we can have one base wrapper (restaurant
) that defines the generic data structure for all kinds of restaurants, and then derive different wrappers from that base wrapper for different websites, since we will need different extraction rules for each one of them -- each website will use its unique visual styles and page layout.
More abstractly, this strategy separates data structure definitions from site-specific extraction rules. This separation of concerns promotes reusability and maintainability in practice.
The new wrapper can be placed in the same file after wrapper restaurant
.
The new wrapper will contain all the fields we defined for in restaurant
, as well as fields inherited from base types such as compound_document
since we want to apply extraction rules for them. It will look like this, with blank XPath expressions for now:
<meta_metadata name="urban_spoon_restaurant" type="restaurant" parser="xpath"
comment="UrbanSpoon restaurant description page">
<selector url_path_tree="http://www.urbanspoon.com/r/" />
<example_url url="http://www.urbanspoon.com/r/114/875031/restaurant/College-Station/Christophers-World-Grill-Bryan" />
<scalar name="title">
<xpath></xpath>
</scalar>
<scalar name="phone">
<xpath></xpath>
</scalar>
<scalar name="pic">
<xpath></xpath>
</scalar>
<scalar name="rating">
<xpath></xpath>
</scalar>
<scalar name="price_range">
<xpath></xpath>
</scalar>
<scalar name="map">
<xpath></xpath>
</scalar>
<collection name="genres">
<xpath></xpath>
<scalar name="title">
<xpath></xpath>
</scalar>
<scalar name="location">
<xpath></xpath>
</scalar>
</collection>
</meta_metadata>
The following key changes were made from the previous tutorial:
- Element selector is added. We will address it in this tutorial after explaining extraction rules.
- Element example_url provides a way for attaching an exemplar input URL for this wrapper. Multiple example_url elements are allowed. It is a good practice to attach exemplar URLs to authored wrappers, since they help testing and maintenance.
- Attribute type is used instead of extends, which means using the data structure defined by that wrapper (
restaurant
in the example) directly and only (in terms of its attributes and fields), without defining any new fields. In other words, this wrapper does not define an actually new type; it is only reusing an existing type. - Attribute parser="xpath" means this wrapper will use the XPath parser to extract metadata from the raw HTML. When the data is presented as XML, such as RSS or a web service API, direct binding (parser="direct") can be used to directly map XML tree to metadata;
- Xpaths are XML fields. You may add multiple xpath fields if, for example, the name of a restaurant may be in one of two places on a web page, but you cannot guarantee which it will be in.
Also note that there are now two fields in genres. Remember that genres was defined with child_type="compound_document". These are the fields from the compound_document type that we will be using, title - String and location - ParsedURL, to store the name of the food genre and a link to the respective UrbanSpoon search page. When including collections of web pages, regardless of the source page's type, using compound_document with the title and location fields will allow for the user to expand those webpages using MICE.
If you are not familiar with XPath expressions please visit this tutorial.
We will look closely at the XPath for three distinct elements: title, pic, and genres.
Using the Chrome web inspecter we can see that the restaurant's title in the HTML:
<div class='hreview-aggregate' id='directory'>
<div class='item'>
<h1 class='page_title fn org'>Christopher's World Grill</h1>
...
Therefore a correct XPath expression for extracting the restaurant title would be: <xpath>//div[@id='directory']/div/h1</xpath>
- Strategy 1: Using the class and id attributes to precisely specify the piece of information you want.
- Many websites uses meaningful class and id attributes to layout their web pages. Proper use of them can make your XPath expressions more accurate and succinct, and more possible to endure visual style changes made by the website.
- Strategy 2: If you want to extract text from a node that is formatted, ie, some of the text may be strong, use the xpath "../parentNode" instead of "../parentNode/text()".
<a href="/rph/114/875031/27276/college-station-christopher-s-world-grill-christopher-s-world-grill-photo">
<img alt="Christopher's World Grill" height="130" src='/images/1/blank.gif' srcd="http://a1.urbancdn.com/w/s/ht/n1Pc0RahTuNJQK-130.jpg" width="130" />
</a>
There are many images on the page and none of the <div>'s class or id attributes are very helpful. Instead of forming a path like the previous one I used XPather to help me deduce the correct expression: <xpath> //div[@id='aside']/div[@class='list]/ul[1]/li[1]/ div[@class='photo']/div[@class='image']/a/img/@srcd </xpath> Note the srcd attribute is what we really need.
The last XPath we will explore will be for genres. Because the genres field is a collection, the XPath will need to be for a list of nodes. Looking through the HTML and using XPather I have formulated a correct expression: <xpath> //div[@id='secondary']/div[@class='cuisines']/fieldset/a </xpath>
It returns three nodes, one for each of the listed genres.
For each genre we want the name as a String and the link to its respective search page as a ParsedURL. We can use relative XPaths for fields nested in a composite or collection field, though global XPaths are still allowed. Here these XPaths will be: <xpath> ./text() </xpath> and <xpath> ./@href </xpath>
The XPath expressions for title and phone begin with the same sequence (that is, the two nodes are inside a common parent node). We can define a variable to store the common component of the XPath expressions to reduce repetition and assist maintenance. We use the def_var tag to define a variable of type node with the common XPath component.
<def_var name="main_block" type="node">
<xpath>//div[@id='center']/div[1]</xpath>
</def_var>
We can then reference this variable in the context_node attribute for title and phone, and shorten the XPath expressions for these fields to contain only the unique ending:
<scalar name="title" context_node="main_block" />
<xpath>./h1</xpath>
</scalar>
<scalar name="phone" context_node="main_block">
<xpath>./h3</xpath>
</scalar>
Here shows the full wrapper:
<meta_metadata name="urban_spoon_restaurant" type="restaurant" parser="xpath"
comment="UrbanSpoon restaurant description page">
<selector url_path_tree="http://www.urbanspoon.com/r/" />
<example_url url="http://www.urbanspoon.com/r/114/875031/restaurant/College-Station/Christophers-World-Grill-Bryan" />
<def_var name="main_block" type="node">
<xpath>//div[@id='center']/div[1]</xpath>
</def_var>
<scalar name="title" context_node="main_block">
<xpath>./h1</xpath>
</scalar>
<scalar name="phone" context_node="main_block">
<xpath>./h3</xpath>
</scalar>
<scalar name="pic">
<xpath>//div[@id='aside']/div[@class='list photos']/ul[1]/li[1]/div[@class='photo']/div[@class='image']/a/img/@srcd</xpath>
</scalar>
<scalar name="rating">
<xpath>//div[@id='vote_block']/div[@class='score up']/div[@class='number']/span[@class='digits percent-text rating average']</xpath>
</scalar>
<scalar name="price_range">
<xpath>//div[@id='secondary']/div[@class='menu']/fieldset/div[@class='price']/span[@class='pricerange'] </xpath>
</scalar>
<scalar name="map" context_node="main_block>
<xpath>./div[@class='address adr']/a[1]/@href</xpath>
</scalar>
<collection name="genres">
<xpath>//div[@id='secondary']/div[@class='cuisines']/fieldset/a</xpath>
<scalar name="title">
<xpath>./text()</xpath>
</scalar>
<scalar name="location">
<xpath>./@href</xpath>
</scalar>
</collection>
</meta_metadata>
Note that XPaths may need change because websites may update their layout.
There is still a problem unresolved: with many wrappers in the repository, how does BigSemantics knows which to use for a specific information resource?
The solution is a delicate structure called selectors that helps BigSemantics automatically pick the appropriate wrapper for an input URL. The involving factors that BigSemantics considers to make this choice includes URL patterns, MIME types, and suffixes.
One wrapper can have multiple selectors, to specify multiple situations that this wrapper should be used. If the input URL matches any of the selectors for a wrapper, the wrapper will be used.
Note: ampersands in URL's will cause an error. Replace any '&' with "&_amp;" (without the '_').
The most widely used selectors specify the URL pattern for information resources that a wrapper should handle. There are 3 types of patterns:
- url_stripped
- This matches against the part of the URL without queries (the part before question mark).
- url_path_tree
- This matches against the beginning part of the hierarchy in a URL. This has been used in the example with UrbanSpoon.
- url_regex and url_regex_fragment
- These allow you to write a Regular Expression to match against the input URL. The former must match the whole input URL, while the latter can match a fragment of the input URL. Expressions must be compliant with the Javascript flavor of regex, so things like lookbehinds are will not work. When these are used, you must specify the domain attribute for the selector to limit matching for a specific domain, for performance considerations.
<selector url_stripped="http://www.nsf.gov/awardsearch/advancedSearchResult">
<param name="PILastName" />
</selector>
<selector url_regex="https?://www.google.com/search\?.*" domain="google.com">
<param name="tbm" value="isch" />
</selector>
Selectors can also select by MIME types or suffixes. For example, the following selector matches common image formats:
<selector>
<mime_type>image/jpeg</mime_type>
<mime_type>image/png</mime_type>
<mime_type>image/gif</mime_type>
<mime_type>image/bmp</mime_type>
<suffix>jpg</suffix>
<suffix>jpeg</suffix>
<suffix>gif</suffix>
<suffix>png</suffix>
<suffix>bmp</suffix>
</selector>
Different kinds of selectors have different priority when used together.
URL patterns will be considered first. Out of URL patterns, url_stripped has the highest priority. If matching by url_stripped failed, url_path_tree will be considered next. If matching failed again, url_regex will be used, then url_regex_fragment. Parameters will only be considered when the enclosing URL pattern selector is activated.
If URL patterns do not match any particular wrapper, suffix will be considered.
If neither URL patterns or suffixes match any particular wrapper, the default wrapper compound_document
will be used.
When BigSemantics connects to the page, if MIME type is provided by the server, it will try to use MIME type to match a particular wrapper if previous attempts for matching a wrapper failed.
Some websites attaches extra URL query parameters in their links, to facilitate session management and tracking. This usually makes it hard to compare URLs, because two URLs with different session IDs or tracking information may actually point to the same information resource.
BigSemantics allows you use location filters for a wrapper to filter out those query parameters. This happens after an input URL is matched with the wrapper, but before any other processing.
Location filters support the following actions:
- set_param: Adding or setting a query parameter with a specific name, and optionally a specific value.
- strip_param: Removing unwanted query parameters by name.
- regex: Matching and replacing arbitrary strings in the URL. This is really flexible.
- alternative_host: Specifying that the same information resource can be addressed by alternative hosts.
<filter_location>
<strip_param name="coll" />
<strip_param name="dl" />
<strip_param name="CFID" />
<strip_param name="CFTOKEN" />
<set_param name="preflayout" value="flat" />
<regex match="id=\d+\.(\d+)" replace="id=$1" />
<alternative_host>portal.acm.org</alternative_host>
<alternative_host>dl.acm.org</alternative_host>
</filter_location>
Considering an input URL http://portal.acm.org/citation.cfm?id=1871437.1871580&coll=DL&dl=ACM&CFID=178910658&CFTOKEN=90231150 is matched with this wrapper, and is being filtered now:
- First, parameters such as "coll", "dl", "CFID", and "CFTOKEN" will be removed, resulting in http://portal.acm.org/citation.cfm?id=1871437.1871580
- Then, a parameter "preflayout" will be set to "flat", resulting in http://portal.acm.org/citation.cfm?id=1871437.1871580&preflayout=flat
- The numeric ID matches with the regular expression, thus will be replaced, resulting in http://portal.acm.org/citation.cfm?id=1871580&preflayout=flat
- Finally, the system acknowledges that a different host, dl.acm.org, may be used to address the same information resource. That is essentially to say: if you see http://dl.acm.org/citation.cfm?id=1871580&preflayout=flat later, it is the same thing.
Now you've learned how to create data definitions (types, schemas, data models, or whatever you would like to call them) and attach extraction rules for specific information sources using wrappers.
If you are a developer, you can now take a look at compiling wrappers and running MmTest to see how it works.
The next tutorial will explain how to use semantic actions to implement control flows on extracted metadata, and connect them to your own application.