Dirichlet process gone bad: stick is broken in wrong place #33

rafis · 2018-03-10T17:48:14Z

I have trained a model on text8 corpus with the following config. (Please notice that this example sometimes work and show accurate result with other configs.)

./run.sh train.jl --epochs 5 --alpha 0.05 --prototypes 10 --min-freq 20 --remove-top-k 70 --window 5 text8 text8.dic text8.model

When I check apple word, first the amount senses (meanings):

julia> expected_pi(vm, dict.word2id["apple"])
10-element Array{Float64,1}:
 0.197259
 0.216447
 0.58626
 3.24536e-5
 1.54719e-6
 7.37607e-8
 3.51647e-9
 1.67644e-10
 7.99224e-12
 4.00096e-13

We have 3 senses and 7 free slots - nothing unusual. Then I ask to describe each sense:

julia> nearest_neighbors(vm, dict, "apple", 1, 10)
10-element Array{Tuple{AbstractString,Int64,Float32},1}:
 ("macintosh",2,0.6276491f0)
 ("intel",2,0.5980226f0)
 ("ibm",2,0.59220535f0)
 ("compaq",1,0.5730073f0)
 ("inc",2,0.572671f0)
 ("store",2,0.56161773f0)
 ("raskin",1,0.56127656f0)
 ("corp",1,0.55665475f0)
 ("ceo",1,0.54154074f0)
 ("ceo",2,0.54141444f0)

julia> nearest_neighbors(vm, dict, "apple", 2, 10)
10-element Array{Tuple{AbstractString,Int64,Float32},1}:
 ("apples",1,0.76360685f0)
 ("sweet",1,0.70247304f0)
 ("juice",1,0.6916403f0)
 ("cakes",1,0.6847711f0)
 ("fermented",1,0.681853f0)
 ("olive",1,0.6792287f0)
 ("fruit",1,0.6718393f0)
 ("peas",1,0.6700381f0)
 ("berries",1,0.66832954f0)
 ("roasted",1,0.66814494f0)

julia> nearest_neighbors(vm, dict, "apple", 3, 10)
10-element Array{Tuple{AbstractString,Int64,Float32},1}:
 ("macintosh",1,0.9284175f0)
 ("computers",1,0.8870821f0)
 ("pc",1,0.88180965f0)
 ("compatible",1,0.8577318f0)
 ("amiga",1,0.83944887f0)
 ("ibm",1,0.8265453f0)
 ("desktop",1,0.8234609f0)
 ("portable",1,0.81334895f0)
 ("pcs",1,0.8022719f0)
 ("dos",1,0.8022494f0)

As you can see the first and the third senses actually we same, why did AdaGram broken it into 2 different senses?

The text was updated successfully, but these errors were encountered:

rversteegen · 2018-03-11T09:55:01Z

Those are two quite different senses, aren't they? Apple Inc (the company) vs Apple computers (the product). (Although 'ibm' appears in the nearest neighbour list for both senses, I think those also differ by being related to IBM the company and IBM PCs)

When this "worked" for you, what senses did you get?

rversteegen · 2018-03-11T09:58:26Z

Oh, and I see two different senses of 'macintosh' also appear in the nearest neighbour lists. It seems to be mistaken into splitting macintosh into two senses (in addition to Macintosh apples).

glicerico · 2018-05-25T04:48:19Z

I have seen this behavior before as well, and was wondering if my corpus is not large enough or something else is wrong. Actually, sometimes I find that two senses of a word are near enough that they appear in each other's nearest neighbors list.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dirichlet process gone bad: stick is broken in wrong place #33

Dirichlet process gone bad: stick is broken in wrong place #33

rafis commented Mar 10, 2018 •

edited

Loading

rversteegen commented Mar 11, 2018

rversteegen commented Mar 11, 2018

glicerico commented May 25, 2018

Dirichlet process gone bad: stick is broken in wrong place #33

Dirichlet process gone bad: stick is broken in wrong place #33

Comments

rafis commented Mar 10, 2018 • edited Loading

rversteegen commented Mar 11, 2018

rversteegen commented Mar 11, 2018

glicerico commented May 25, 2018

rafis commented Mar 10, 2018 •

edited

Loading