-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve nsub and nn dependencies analysis #106
Conversation
hum, perhaps it was wrong. Finally, there will also be a big change on many other rules in this pull request |
For questions like "directed by" or "written by" or "killed by" we should maybe have triples in the form (?, p, o) like (?, director, Splielberg) or (?,writer,Victor Hugo) because usually we state that (Les Misérables, author, Victor Hugo) like in "The author of Les Misérables is Victor Hugo". Thanks a lot for these additions! |
Yes, all these things will be fix in this pull request (today i hope) |
Done! I will write down some details afterwards, but you can test and report any problem right now (even reverse triples) |
"What language is spoken in Argentina?" is not handled correctly (it was before 1c93ec6 ). |
Barack Obama is not an instance of |
i don't think it's our problem |
The reason of this is that However, i didn't find enough examples when it happens (a There are 3 possibilities :
|
Seems better. More examples:
|
Your 2 first examples are interesting but it's not the problem here (i think i will generalize to all dependencies the current algorithm that produces The 3rd example corresponds to our topic. But as you can see there are 2 possible triples. New possibility to handle
Ex:
This new solution is interesting because it enables us to use the information contained into the preposition. Until now, we didn't use it (all the prepositions are removed) and we lose sometimes parts of multi-word verbs. |
@ everyone please give your opinion, i think it's an important question...
Is it so expensive to handle triples with missing subject ? I can't imagine that the algorithms need to look into all the entities of the database to find which one have the good predicate and object. Let's imagine you want to parse a text to fill in a database. You have to process the sentence
If you think that we need to add only the first triple in the database, then we must use the rule "object missing" for nsubj(pass) ( If you add only the second triple into the database, same thing, you cannot answer "Where does the animal live?" if you don't use "missing subject" for nsubj(pass). Now, if you add the 2 full triples into the database, you can produce for each question 2 different successful queries:
According to me, a database should be built with the last technique. If
The problem is that wikidata does not always contain this kinds of "reversed triples" ( Potential agreementFor each verb (ex:
Then, you change the datamodel. A triple with hole is Examples:
If you ask |
I think the first item is better. Missing subjects will still be rare, and I think it makes sense to still have missing subjects in that case because databases are less likely to have that kind of mapping anyway (eg. there are far more inhabitants of a country than there are countries; so it's more relevant to have a |
Efficiency:More knowledge bases are structured as directed graph: the triple
Such database are of course stored with fast indexes for such queries. But we talk about databases with around one hundred million of triples.
The big issue is that it creates a lot of redundancy and when the triple Triples:I agree with @yhamoudi, output the two triples is, I think, the right solution because it really ease the job of modules. For a lot of cases some databases use But, currently, the Wikidata module is very bad to handle missing subject triples (it has to do a request to Wikidata Query that is very slow). But I should implement missing subject triples lookup in our EntityStore in the future. But should we change the data model for that? Your |
I like the "Potential agreement" from @yhamoudi and I think that we should code it with |
It will replace the 2 actual triples with hole (each one can be represented with N1 or N2=[]). I find it a bit strange to duplicate each triple (our trees will no longer been readable :( With
If we split the triples, some module will try to solve triples |
Good point. I haven't thought about it. I don't like your semantic of
It will happen too with your definition. Module A may go in the "if" clause of your algorithm and module "B" in the "else" clause". So, my current opinion is to replace current triples with hole with |
OK, if we clearly precise that these lists of predicates are different attempts to nounify a same words (and then each module developer decides whether he wants to go through the whole lists or just stop at the first existing predicate into its database). Full triples: (Triple with hole: |
Other opinions? @progval said that it could be more difficult for external developers to understand this structure. I think that we can first introduce what is a triple
Full triples If we also expose why we choose this formalism (predicates/reverse predicates/multiple predicates/...), i think it will be not so difficult to understand. On the other hand, if we use
We should keep |
Ok I am convinced. You should propose a pull request for the documentation. As this will certainly be a tedious work to change the triples in the implementation, I suggest that we merge this pull request (which is already huge) and that we do a new one when the implementation of the new datamodel is ready. |
Strong +1 A new proposal that would archive the same goal with, I think, a less disruptive change: Change in abstract data model.We introduce the notion of property: a property is a resource that may be used as predicate. Example: birth date is a property but Douglas Adam is not. We introduce the That's it. If we use the definition of Change in serializationWe just add to triples (i.e. for both full triples and triple with hole) an optional parameter "reverse-predicate" that contains the list of the reverse predicates if we want some. The serialization of
Remark: We support with this serialization all the reasonable use cases of reverse: this operator has only impact on triples. Pro:
Cons:
|
I like it. I find it easier to understand. And it will be much easier to adopt.
Yes, especially if we say to beginners that they can ignore the reverse predicates.
I don't see how it is possible: the datamodel libs do not know how to reverse a predicate, only question parsing modules do. |
I was thinking about very simple things like replace the resource |
More difficult than what? It would the same difficulty for us than with
For instance, we will represent I think we should give the same importance to predicates and reverse predicates, in order not to encourage people to use only predicates. Except for the more "easy" modules (like OEIS), all modules that expect to handle questions with verbs different from "be" need reverse predicates (since most of the verbs can be nounified into 2 different way. Ex:
And there is no solution to this. The choice between |
Let's discuss about it here: ProjetPP/Documentation#52 @yhamoudi: I am ok for the pull request, you can (and should) merge it. |
ok, i've just some things to check and then i merge (perhaps tomorrow) |
Say bye to the pull request 👋 |
Improve nsub and nn dependencies analysis
Should I make a new release? |
yes you can |
Many important modifications (finally the algorithm has not been improved only on nsubj relations :)
nsubj relations and instance_of
I have observed that there were 2 types of nsubj relations:
instance_of
triples. Examples:(?, instance of, language)∩(Argentina, language, ?)
(?, instance of, movie)∩(?, director, Spielberg)
(?, instance of, book)∩(?, writer, Victor Hugo)
(?, instance of, president)∩(?, killer, Oswald)
In order to deal with the first type of nsubj(pass) relations, i needed to add 2 new grammatical dependencies:
nsubj_qw
andnsubjpass_qw
. These rules replacensubj
andnsubjpass
when it is relevant (seeupdateNsubjRule
inquestionWordProcessing.py
). I also added a ruleR6
to handleinstance_of
triples.nn/amod rules and merging
I've added an heuristic to improve the multiwords expression recognition. Previously, we were merging all
nn
relations. However, it is not relevant in some cases:United States
andpresident
are mergedNow the merging is not performed between
a -nn-> b
if a and b have different namedEntityTag (for instanceLocation
andundef
).We need to perform more tests to be sure it's relevant. I've still found a problem: the named entity tagging is sometimes bad. Ex:
S.
is not taggedPERSON
F.
is not taggedPERSON
We should at least correct these tags: if a word
v
is between 2 wordsu
andw
that have the same tag +v
is linked tou
orw
by ann
relation -> then add the tag ofu
andw
tov
.TODO before accepting the pull request
I propose to accept the pull request once these tasks will be done (properly):
S.
andF.
). See Develop some heuristics to correct the dependency tree #107nn
dependenciesprep
dependenciesnsubj
dependencies