Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve nsub and nn dependencies analysis #106

Merged
merged 41 commits into from
Feb 11, 2015
Merged

Improve nsub and nn dependencies analysis #106

merged 41 commits into from
Feb 11, 2015

Conversation

yhamoudi
Copy link
Member

@yhamoudi yhamoudi commented Feb 6, 2015

Many important modifications (finally the algorithm has not been improved only on nsubj relations :)

nsubj relations and instance_of

I have observed that there were 2 types of nsubj relations:

  • the subject is "directly" linked to the question word. This case occurs when the question word appears in a subtree linked to the root by the relations nsub(pass). In these cases, we must produce instance_of triples. Examples:
    • What language is spoken in Argentina > (?, instance of, language)∩(Argentina, language, ?)
    • List movies directed by Spielberg > (?, instance of, movie)∩(?, director, Spielberg)
    • which book was authored by Victor Hugo > (?, instance of, book)∩(?, writer, Victor Hugo)
    • Which president has been killed by Oswald > (?, instance of, president)∩(?, killer, Oswald)
  • the other case: it is what we were still supporting. Examples:
    • What is the capital of India
    • When was Benjamin Disraeli prime minister
    • Who is the author of Sea and Sky
    • Is there a ghost in my house

In order to deal with the first type of nsubj(pass) relations, i needed to add 2 new grammatical dependencies: nsubj_qw and nsubjpass_qw. These rules replace nsubj and nsubjpass when it is relevant (see updateNsubjRule in questionWordProcessing.py). I also added a rule R6 to handle instance_of triples.

nn/amod rules and merging

I've added an heuristic to improve the multiwords expression recognition. Previously, we were merging all nn relations. However, it is not relevant in some cases:

  • Who is the United States president > United States and president are merged

Now the merging is not performed between a -nn-> b if a and b have different namedEntityTag (for instance Location and undef).

We need to perform more tests to be sure it's relevant. I've still found a problem: the named entity tagging is sometimes bad. Ex:

  • Where was Ulysses S. Grant born? > S. is not tagged PERSON
  • What actor married John F. Kennedy's sister? > F. is not tagged PERSON

We should at least correct these tags: if a word v is between 2 words u and w that have the same tag + v is linked to u or w by a nn relation -> then add the tag of u and w to v.

TODO before accepting the pull request

I propose to accept the pull request once these tasks will be done (properly):

  • correct the MWE bug previously described (S. and F.). See Develop some heuristics to correct the dependency tree #107
  • have 50 deep tests in total (mainly linked to the previous modifications)
  • more (unit)tests on the previous improvements
  • fix the analysis of nn dependencies
  • fix the analysis of prep dependencies
  • fix the analysis of nsubj dependencies

@yhamoudi
Copy link
Member Author

yhamoudi commented Feb 6, 2015

This problem is not directly linked to this pull request

hum, perhaps it was wrong. Finally, there will also be a big change on many other rules in this pull request

@Tpt
Copy link
Member

Tpt commented Feb 7, 2015

For questions like "directed by" or "written by" or "killed by" we should maybe have triples in the form (?, p, o) like (?, director, Splielberg) or (?,writer,Victor Hugo) because usually we state that (Les Misérables, author, Victor Hugo) like in "The author of Les Misérables is Victor Hugo".

Thanks a lot for these additions!

@Ezibenroc
Copy link
Member

@Tpt: +1. But @yhamoudi said "some triples are not in the right order".

@yhamoudi
Copy link
Member Author

yhamoudi commented Feb 7, 2015

Yes, all these things will be fix in this pull request (today i hope)

@yhamoudi
Copy link
Member Author

yhamoudi commented Feb 7, 2015

Done! I will write down some details afterwards, but you can test and report any problem right now (even reverse triples)

@Ezibenroc
Copy link
Member

"What language is spoken in Argentina?" is not handled correctly (it was before 1c93ec6 ).

@Ezibenroc
Copy link
Member

Barack Obama is not an instance of president: https://www.wikidata.org/wiki/Q76

@Ezibenroc
Copy link
Member

@yhamoudi
Copy link
Member Author

yhamoudi commented Feb 7, 2015

Barack Obama is not an instance of president: https://www.wikidata.org/wiki/Q76

i don't think it's our problem

@yhamoudi
Copy link
Member Author

yhamoudi commented Feb 7, 2015

"What language is spoken in Argentina?" is not handled correctly (it was before 1c93ec6 ).

The reason of this is that Argentina is linked to a verb (spoken) by a prep relation. In this case, we use rule R3: List movies directed by Spielberg -> (?,instance of,movie)∩(?,director,Spielberg)

However, i didn't find enough examples when it happens (a prep relation linked to a verb).

There are 3 possibilities :

  • we find more examples of prep+verb, and then we deduce what is the most relevant rule
  • we always use rule R5 for prep (here it's prep_in), except for prep_by (use R3)
  • we consider that rule R3 is the good one. In your example, we can build the form (?,instance of,language)∩(?,spoken in,Argentina) (so it's more a problem of nounification). It is the most elegant solution.

@Ezibenroc
Copy link
Member

we always use rule R5 for prep (here it's prep_in), except for prep_by (use R3)

Seems better. More examples:

  • prep_in: In which countries is the Lake Victoria? > (?, instance of, country)∩(Lake Victoria, country, ?) (right now we return (country, identity, ?)∩(Lake Victoria, identity, ?) for this one).
  • prep_from: From which country is Alan Turing? > (?, instance of, country)∩(Alan Turing, country, ?) (right now we return (country, identity, ?)∩(Alan Turing, identity, ?) for this one).
  • prep_on: What kings ruled on France? > (?, instance of, king)∩(France, ruler, ?) but also (?, instance of, king)∩(?, ruled on, France).

@yhamoudi
Copy link
Member Author

yhamoudi commented Feb 7, 2015

Your 2 first examples are interesting but it's not the problem here (i think i will generalize to all dependencies the current algorithm that produces instance of. Then, if we add in which and from which in our question words we will obtain the right results). The problem is not about how to handle instance of (even if the following examples include instance of).

The 3rd example corresponds to our topic. But as you can see there are 2 possible triples.

New possibility to handle a -prep_x-> b:

  • if a is a noun, produce (b,a,?) (still what we do)
  • if a is a verb (except be), produce (?, a x, b). We keep the past participle of the verb a (no nounification applied on it) and we add the preposition x. Naturally, in the final results we also add the nounified nouns to a x.

Ex:

  • What kings ruled on France? > (?, instance of, king)∩(?, [ruled on, ruler, ...], France)
  • What language is spoken in Argentina? > (?, instance of, language)∩(?, [spoken in, language, spokesman, ...], Argentina)

This new solution is interesting because it enables us to use the information contained into the preposition. Until now, we didn't use it (all the prepositions are removed) and we lose sometimes parts of multi-word verbs.

@yhamoudi
Copy link
Member Author

yhamoudi commented Feb 8, 2015

@ everyone please give your opinion, i think it's an important question...


We must choose the most efficient one for databases queries. Do not forget the final goal of our tool.

Is it so expensive to handle triples with missing subject ? I can't imagine that the algorithms need to look into all the entities of the database to find which one have the good predicate and object.


Let's imagine you want to parse a text to fill in a database. You have to process the sentence The animal lives in the farm. You can associate 2 different full triples to this sentence:

  • (animal,residence,farm)
  • (farm,inhabitant,animal)

If you think that we need to add only the first triple in the database, then we must use the rule "object missing" for nsubj(pass) (Where does the animal live?) and the rule "subject missing" for dobj/prep (Who lives in the farm?). If you use "object missing" for dobj/prep you cannot answer "Who lives in the farm?"

If you add only the second triple into the database, same thing, you cannot answer "Where does the animal live?" if you don't use "missing subject" for nsubj(pass).

Now, if you add the 2 full triples into the database, you can produce for each question 2 different successful queries:

  • Where does the animal live? -> (animal,residence,?) or (?,inhabitant,animal)
  • Who lives in the farm? -> (?,residence,farm) or (farm,inhabitant,?)

According to me, a database should be built with the last technique. If the animal lives in the farm, your database should have 2 informations:

  • animal + property residence -> farm
  • farm + property inhabitant -> animal

The problem is that wikidata does not always contain this kinds of "reversed triples" ((James and the Giant Peach, author, ?) exists but there is nothing about James and the Giant Peach on the Roald Dahl webpage). This is not our problem, this is wikidata's problem!

Potential agreement

For each verb (ex: live in), we make a distinction (= 2 different nounification maps) between the 2 kinds of nouns(=predicates) that can be obtained from it:

  • the nouns N1 that enable you to transform X lives in Y into (X, N1, Y) (ex for lives in: residence, location, ...)
  • the nouns N2 that enable you to transform X lives in Y into (Y, N2, X) (ex for lives in: inhabitant,...)

Then, you change the datamodel. A triple with hole is [X,N1,N2] that returns the set of values V such that (X,N1,V) is a correct full triple, or (V,N2,X) is a correct full triple.

Examples:

  • Where does the animal live? -> [animal,residence,inhabitant]
  • Who lives in the farm? -> [farm,inhabitant,residence]

If you ask Where does the animal live? and your database contains (farm,inhabitant,animal) but not (animal,residence,farm), no problem -> we find the answer thanks to the redundancy, we are more robust.

@progval
Copy link
Member

progval commented Feb 9, 2015

How could we choose:

  • choose the more natural way to do. I think it's the symmetry with missing object for nsubj/nsubjpass and missing subject for prep/dobj.
  • choose the most efficient one for databases queries (@Tpt ?) : it seems that triples with missing object are more easy to solve + triples like (James and the Giant Peach, author, ?) are more likely to exist than (?, works, James and the Giant Peach)

I think the first item is better. Missing subjects will still be rare, and I think it makes sense to still have missing subjects in that case because databases are less likely to have that kind of mapping anyway (eg. there are far more inhabitants of a country than there are countries; so it's more relevant to have a person -> countries mapping than country -> person.

@Tpt
Copy link
Member

Tpt commented Feb 9, 2015

Efficiency:

More knowledge bases are structured as directed graph: the triple (a, b, c) is in the knowledge base, if, and only if, its graph contains the labeled edge a ->^b c. So resolve (a, b, ?) is easy but resolve (?, a, b) is harder (need to get incident edges to c).

I can't imagine that the algorithms need to look into all the entities of the database to find which one have the good predicate and object.

Such database are of course stored with fast indexes for such queries. But we talk about databases with around one hundred million of triples.

According to me, a database should be built with the last technique.

The big issue is that it creates a lot of redundancy and when the triple (a, b, c) is changed you have to change also the triple (c, reverse(b), a). So, it won't be done in database like Wikidata. But we can create on top of Wikidata a reasoning engine that adds triples that may be entailed from Wikidata ones to our Wikidata clone.


Triples:

I agree with @yhamoudi, output the two triples is, I think, the right solution because it really ease the job of modules. For a lot of cases some databases use (a, b, c) and others (a, reverse(b), c) so pick one technique is, I think, not a viable option.

But, currently, the Wikidata module is very bad to handle missing subject triples (it has to do a request to Wikidata Query that is very slow). But I should implement missing subject triples lookup in our EntityStore in the future.

But should we change the data model for that? Your [X, N1, N2] may be easily, I think, rewritten (X, N1, ?) ∪ (?, N2, X). It won't expend much the size of the query trees and avoid the creation of yet another complicated operator.

@Ezibenroc
Copy link
Member

I like the "Potential agreement" from @yhamoudi and I think that we should code it with (X, N1, ?) ∪ (?, N2, X), as mentionned by @Tpt.
As you said, it will provide us a more robust output.
The drawback is that we will have more missing subject than before: the back-end modules must be able to handle them efficiently.

@yhamoudi
Copy link
Member Author

yhamoudi commented Feb 9, 2015

avoid the creation of yet another complicated operator.

It will replace the 2 actual triples with hole (each one can be represented with N1 or N2=[]).

I find it a bit strange to duplicate each triple (our trees will no longer been readable :( With [X,N1,N2] the algorithm is quite clear for databases querying tools:

    If one of the predicates N of N1 exists for X into the database:
        Return the set of Y s.t. (X,N,Y)
    Else If one of the predicates N of N2 points  to X into the database:
        Return the set of Y s.t. (Y,N,X)

If we split the triples, some module will try to solve triples (?,N,X) whereas (X,reverse(N),?) has already been solved elsewhere.

@Tpt
Copy link
Member

Tpt commented Feb 9, 2015

It will replace the 2 actual triples with hole (each one can be represented with N1 or N2=[]).

Good point. I haven't thought about it.

I don't like your semantic of [X, N1, N2] because as we have (a, b, c) = true <-> (c, reverse(b), c) = true it will make modules don't return all possible results if the graph stored database is not complete (e.g. all possible statements are not there).

If we split the triples, some module will try to solve triples (?,N,X) whereas (X,reverse(N),?) has already been solved elsewhere.

It will happen too with your definition. Module A may go in the "if" clause of your algorithm and module "B" in the "else" clause".

So, my current opinion is to replace current triples with hole with [X, N1, N2] that has the semantic: [X, B1, B2] = { c | ∃ a ∈ X ∃ b ∈ B1 (a, b, c) } ∪ { a | ∃ c ∈ X ∃ b ∈ B2 (a, b, c) }.

@yhamoudi
Copy link
Member Author

yhamoudi commented Feb 9, 2015

So, my current opinion is to replace current triples with hole with [X, N1, N2] that has the semantic: [X, B1, B2] = { c | ∃ a ∈ X ∃ b ∈ B1 (a, b, c) } ∪ { a | ∃ c ∈ X ∃ b ∈ B2 (a, b, c) }.

OK, if we clearly precise that these lists of predicates are different attempts to nounify a same words (and then each module developer decides whether he wants to go through the whole lists or just stop at the first existing predicate into its database).

Full triples: [X, B1, B2, Y] = true iff for all x ∈ X, y ∈ Y,∃ b ∈ B1, (a,b,c) or ∃ c ∈ B2 (c,a,b) ?

(Triple with hole: [X,B1,B2,?] ?)

@yhamoudi
Copy link
Member Author

yhamoudi commented Feb 9, 2015

Other opinions?

@progval said that it could be more difficult for external developers to understand this structure. I think that we can first introduce what is a triple (a,b,c) and then define [X, B1, B2] (or [X, B1, B2, ?]) and [X, B1, B2, Y] using it:

  • [X, B1, B2] = { c | ∃ a ∈ X, ∃ b ∈ B1, (a, b, c) } ∪ { a | ∃ c ∈ X, ∃ b ∈ B2, (a, b, c) }
  • [X, B1, B2,Y] = ∀ x ∈ X, ∀ y ∈ Y, (∃ b ∈ B1, (a,b,c) ∨ ∃ b ∈ B2, (c,b,a))

Full triples (a,b,c) and triples will holes (a,b,?) will no longer be available into the datamodel, but we use (a,b,c) to explain what are [X, B1, B2] and [X, B1, B2, Y]. With some examples, it's quite clear.

If we also expose why we choose this formalism (predicates/reverse predicates/multiple predicates/...), i think it will be not so difficult to understand.

On the other hand, if we use (X, N1, ?) ∪ (?, N2, X), i think it would be more difficult for external developers. First, when they will see a tree they won't be able to quickly understand it (since many parts of the tree will be duplicated). Then, let's imagine they know that their database is "consistant" (when (a,b,c)̀ occurs there is also (c,reverse(b),a)). They don't need to look at B2 into [X, B1, B2] (if B1 != []). They have 2 alternatives if we use (X, N1, ?) ∪ (?, N2, X):

  • on an union node such that (X, N1, ?) ∪ (?, N2, X), process (X, N1, ?) only -> bad idea, we could need this kind of configurations for something else than [X,N1,N2]
  • process all triples -> they will lost time.

We should keep for "real" unions and not [X,N1,N2]

@Ezibenroc
Copy link
Member

Ok I am convinced.

You should propose a pull request for the documentation.

As this will certainly be a tedious work to change the triples in the implementation, I suggest that we merge this pull request (which is already huge) and that we do a new one when the implementation of the new datamodel is ready.

@yhamoudi yhamoudi mentioned this pull request Feb 10, 2015
5 tasks
@Tpt
Copy link
Member

Tpt commented Feb 10, 2015

As this will certainly be a tedious work to change the triples in the implementation, I suggest that we merge this pull request (which is already huge) and that we do a new one when the implementation of the new datamodel is ready.

Strong +1


A new proposal that would archive the same goal with, I think, a less disruptive change:

Change in abstract data model.

We introduce the notion of property: a property is a resource that may be used as predicate. Example: birth date is a property but Douglas Adam is not.

We introduce the reverse operator of property → property such that reverse(p) is a reverse property of p i.e. a property such that ∀ a, c (a,p,c) ↔ (c, reverse(p), a). We generalize this operator on list → list (returns a reverse property for each property in the input list).

That's it. If we use the definition of [X, B1, B2] and [X, B1, B2, Y] used by Yassine, we have [X, B1, B2] = (X, B1 ∪ reverse(B2), ?) and [X, B1, B2, Y] = (X, B1 ∪ reverse(B2), ?).

Change in serialization

We just add to triples (i.e. for both full triples and triple with hole) an optional parameter "reverse-predicate" that contains the list of the reverse predicates if we want some.

The serialization of [X, B1, B2] = (X, B1 ∪ reverse(B2), ?) is:

{
    "type": "triple",
    "subject": serialization(X),
    "predicate": serialization(B1),
    "reverse-predicate": serialization(B2),
    "object": {"type": "missing"}
}

Remark: We support with this serialization all the reasonable use cases of reverse: this operator has only impact on triples.


Pro:

  • No breaking changes, we just add a new parameter to triples
  • Easy to understand: NLP algorithms output triples as before but, if the think they need to, they add reverse predicates. Complicated modules (like Wikidata) may want to use them and the other ones don't have to care about them as they only supports a very limited set of questions, so they know more about them. Example: OEIS module may use the speci
  • Beginners may just ignore the reverse-predicate parameter.

Cons:

  • Work of NLP algorithm is a little bit more difficult: they still have to choose where they will put the missing operator but they are now allowed to fail as work of complicated modules is helped with reverse predicates.
  • The discussions about where the missing operator should be are still opened.
  • It introduces two equivalent serializations for the same underlying operator: (X, B1 ∪ reverse(B2), ?) = (?, B2 ∪ reverse(B1), X). But data models libs may do some normalization (as it's done for [a] = a).

@Ezibenroc
Copy link
Member

A new proposal that would archive the same goal with, I think, a less disruptive change:

I like it. I find it easier to understand. And it will be much easier to adopt.

The discussions about where the missing operator should be are still opened.

Yes, especially if we say to beginners that they can ignore the reverse predicates.

But data models libs may do some normalization

I don't see how it is possible: the datamodel libs do not know how to reverse a predicate, only question parsing modules do.

@Tpt
Copy link
Member

Tpt commented Feb 10, 2015

But data models libs may do some normalization

I was thinking about very simple things like replace the resource a by the list [a] in order to have only lists or replace (?, B2 ∪ reverse(B1), X) by (X, B1 ∪ reverse(B2), ?).

@yhamoudi
Copy link
Member Author

Work of NLP algorithm is a little bit more difficult

More difficult than what? It would the same difficulty for us than with [X, B1, B2].

Beginners may just ignore the reverse-predicate parameter.

reverse-predicate shouldn't be seen as parameter that can be ignored, since there is no reason for predicates to be better than reverse predicates .

For instance, we will represent What did Roald Dahl write by (Roald Dahl, pred=works, rev_pred=author, ?). Databases are more likely to contain (James and the Giant Peach, author, Roald Dahl) than (Roal Dahl, works, James and the Giant Peach). If the developer doesn't take reverse predicates into account, he couldn't find the answer (and most of the developers will probably ignore the optional parameters...).

I think we should give the same importance to predicates and reverse predicates, in order not to encourage people to use only predicates.

Except for the more "easy" modules (like OEIS), all modules that expect to handle questions with verbs different from "be" need reverse predicates (since most of the verbs can be nounified into 2 different way. Ex: residence <- live -> inhabitant).

The discussions about where the missing operator should be are still opened.

And there is no solution to this. The choice between (X, pred=B1, rev_pred=B2, ?) and (?, pred=B2, rev_pred=B1, x) will remain arbitrary.

@Ezibenroc
Copy link
Member

Let's discuss about it here: ProjetPP/Documentation#52

@yhamoudi: I am ok for the pull request, you can (and should) merge it.

@yhamoudi
Copy link
Member Author

@yhamoudi: I am ok for the pull request, you can (and should) merge it.

ok, i've just some things to check and then i merge (perhaps tomorrow)

@yhamoudi
Copy link
Member Author

Say bye to the pull request 👋

yhamoudi added a commit that referenced this pull request Feb 11, 2015
Improve nsub and nn dependencies analysis
@yhamoudi yhamoudi merged commit 486ff7b into master Feb 11, 2015
@progval
Copy link
Member

progval commented Feb 12, 2015

Should I make a new release?

@yhamoudi
Copy link
Member Author

yes you can

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants