Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dirichlet process gone bad: stick is broken in wrong place #33

Open
rafis opened this issue Mar 10, 2018 · 3 comments
Open

Dirichlet process gone bad: stick is broken in wrong place #33

rafis opened this issue Mar 10, 2018 · 3 comments

Comments

@rafis
Copy link

rafis commented Mar 10, 2018

I have trained a model on text8 corpus with the following config. (Please notice that this example sometimes work and show accurate result with other configs.)

./run.sh train.jl --epochs 5 --alpha 0.05 --prototypes 10 --min-freq 20 --remove-top-k 70 --window 5 text8 text8.dic text8.model

When I check apple word, first the amount senses (meanings):

julia> expected_pi(vm, dict.word2id["apple"])
10-element Array{Float64,1}:
 0.197259
 0.216447
 0.58626
 3.24536e-5
 1.54719e-6
 7.37607e-8
 3.51647e-9
 1.67644e-10
 7.99224e-12
 4.00096e-13

We have 3 senses and 7 free slots - nothing unusual. Then I ask to describe each sense:

julia> nearest_neighbors(vm, dict, "apple", 1, 10)
10-element Array{Tuple{AbstractString,Int64,Float32},1}:
 ("macintosh",2,0.6276491f0)
 ("intel",2,0.5980226f0)
 ("ibm",2,0.59220535f0)
 ("compaq",1,0.5730073f0)
 ("inc",2,0.572671f0)
 ("store",2,0.56161773f0)
 ("raskin",1,0.56127656f0)
 ("corp",1,0.55665475f0)
 ("ceo",1,0.54154074f0)
 ("ceo",2,0.54141444f0)

julia> nearest_neighbors(vm, dict, "apple", 2, 10)
10-element Array{Tuple{AbstractString,Int64,Float32},1}:
 ("apples",1,0.76360685f0)
 ("sweet",1,0.70247304f0)
 ("juice",1,0.6916403f0)
 ("cakes",1,0.6847711f0)
 ("fermented",1,0.681853f0)
 ("olive",1,0.6792287f0)
 ("fruit",1,0.6718393f0)
 ("peas",1,0.6700381f0)
 ("berries",1,0.66832954f0)
 ("roasted",1,0.66814494f0)

julia> nearest_neighbors(vm, dict, "apple", 3, 10)
10-element Array{Tuple{AbstractString,Int64,Float32},1}:
 ("macintosh",1,0.9284175f0)
 ("computers",1,0.8870821f0)
 ("pc",1,0.88180965f0)
 ("compatible",1,0.8577318f0)
 ("amiga",1,0.83944887f0)
 ("ibm",1,0.8265453f0)
 ("desktop",1,0.8234609f0)
 ("portable",1,0.81334895f0)
 ("pcs",1,0.8022719f0)
 ("dos",1,0.8022494f0)

As you can see the first and the third senses actually we same, why did AdaGram broken it into 2 different senses?

@rversteegen
Copy link

Those are two quite different senses, aren't they? Apple Inc (the company) vs Apple computers (the product). (Although 'ibm' appears in the nearest neighbour list for both senses, I think those also differ by being related to IBM the company and IBM PCs)

When this "worked" for you, what senses did you get?

@rversteegen
Copy link

Oh, and I see two different senses of 'macintosh' also appear in the nearest neighbour lists. It seems to be mistaken into splitting macintosh into two senses (in addition to Macintosh apples).

@glicerico
Copy link

I have seen this behavior before as well, and was wondering if my corpus is not large enough or something else is wrong. Actually, sometimes I find that two senses of a word are near enough that they appear in each other's nearest neighbors list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants