Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

We should be able to use <example_url> for example URLs that should not be selected #11

Open
quyin opened this issue Mar 5, 2013 · 10 comments
Assignees

Comments

@quyin
Copy link
Member

quyin commented Mar 5, 2013

This can be done by something like:

<example_url url="..." should_not_select="true" />

Also, the unit test that exercises and verifies all example_urls being selected by the corresponding selector should be updated, acknowledging this new attribute.

One use case is: the <imdb_movie> selector should not select this URL (it used to): http://www.rottentomatoes.com/m/django_unchained_2012/trailers/

@quyin
Copy link
Member Author

quyin commented Mar 5, 2013

dont_select will be a better attribute name.

@ghost
Copy link

ghost commented Mar 5, 2013

http://www.rottentomatoes.com/m/django_unchained_2012/trailers/ gets selected by some other nested metadata, right?

@ghost
Copy link

ghost commented Mar 5, 2013

Also: Maybe we make example urls go in a structure instead of adding an attribute?
Mayhaps:

<selects>
    <example_url url=" .... " /> 
    <example_url url=" .... " /> 
</selects> 
<never_selects> 
    <example_url url="...." />
</never_selects> 

@quyin
Copy link
Member Author

quyin commented Mar 5, 2013

No, it is not selected by any metadata, at least for now. We don't have a
good type for it. (Thus it defaults to <compound_document>.)

Best Regards,
Yin Qu (屈垠)

On Mon, Mar 4, 2013 at 9:03 PM, Tom White [email protected] wrote:

http://www.rottentomatoes.com/m/django_unchained_2012/trailers/ gets
selected by some other nested metadata, right?


Reply to this email directly or view it on GitHubhttps://github.com//issues/11#issuecomment-14420920
.

@quyin
Copy link
Member Author

quyin commented Mar 5, 2013

This looks more readable. The only issue is that it will not be compatible
with current format, but I think it might be worthwhile to convert current
format to this new one.

Best Regards,
Yin Qu (屈垠)

On Mon, Mar 4, 2013 at 9:04 PM, Tom White [email protected] wrote:

Also: Maybe we make example urls go in a structure instead of adding an
attribute?
Mayhaps:


Reply to this email directly or view it on GitHubhttps://github.com//issues/11#issuecomment-14420963
.

@ghost
Copy link

ghost commented Mar 5, 2013

Do example urls belong reasonably in the selector? :\
Are there other sorts of parametized testing we could do with examples beyond selection testing?

@andru1d
Copy link
Contributor

andru1d commented Mar 5, 2013

the selector is a good place to put them for authoring.

i think we already have at least one program for collecting those
example_urls. and we will have more.

it seems sensible to me as a structure.

andruid

On Tue, Mar 5, 2013 at 9:15 PM, Tom White [email protected] wrote:

Do example urls belong reasonably in the selector? :
Are there other sorts of parametized testing we could do with examples
beyond selection testing?


Reply to this email directly or view it on GitHubhttps://github.com//issues/11#issuecomment-14465656
.

andruid kerne, ph.d.
director, interface ecology lab
associate professor, department of computer science and engineering
texas a&m university 979.862.3684 fax
college station, tx 77843-3112 http://ecologylab.net

http://facebook.com/ecologylab

Interfaces are the multidimensional border zones through which the
interdependent relationships of people, activities, codes, components,
and systems are constituted. Interface ecology investigates the
dynamic interactions of media, cultures, and disciplines that
flow through interfaces.

@ghost
Copy link

ghost commented Mar 5, 2013

Was just thinking, the names and <never_selects> seem misleading if they're in the selector.
Currently:

<selector url_regex="http://www.rottentomatoes.com/m/[^/]*/" domain="rottentomatoes.com"/> 
<example_url url="http://www.rottentomatoes.com/m/inglourious_basterds/" />

Maybe we could prefer this?

<selector url_regex="http://www.rottentomatoes.com/m/[^/]*/" domain="rottentomatoes.com">
    <should_select>
        <example_url url="http://www.rottentomatoes.com/m/inglourious_basterds/" />
    </should_select> 
    <should_never_select> 
        <example_url url="...." />
    </should_never_select> 
</selector>

Since "should" conveys a heuristic, rather than tempting someone into thinking "If I have a bunch of it'll just select the things I give it."

@quyin
Copy link
Member Author

quyin commented Sep 17, 2014

This needs some discussion before proceeding.

  1. Which format should we use, an extra attribute on <example_url>, or new structs like <should_select> and <should_not_select>? The latter is clearer but not compatible with the current format.
  2. Do we want to include more info in <example_url> for testing, and if so, what should they be? For the first cut, we probably want to specify something like field X must exist, and its value must contain Y. An alternative will be to cache previous extraction results and do a comparison.
  3. We will need some program to exercise example urls and predicates on the server, during daily build time, and tracking failures / changes.

A future step will be to automatically correct extraction rules using cached source page and metadata.

@quyin
Copy link
Member Author

quyin commented Sep 22, 2014

A new tag will be used for bad example urls.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants