Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bad argument #3 to '?' (lookupTable.lua:updateOutput) #8

Closed
lfuelling opened this issue Aug 18, 2016 · 38 comments
Closed

bad argument #3 to '?' (lookupTable.lua:updateOutput) #8

lfuelling opened this issue Aug 18, 2016 · 38 comments
Assignees
Labels

Comments

@lfuelling
Copy link

Hi,

I'm trying to get neuralconvo to work with distro-cl. Unfortunately I'm getting an error when starting the training that I'm unable to solve and that seems to be coming from LookupTable.lua.

I already filed an issue in the project but it seems like it's not going to be fixed there.

First things first, here is the stacktrace:

> $ th train.lua --opencl --gpu 1                                                                                  [±master ●●]
-- Loading dataset
Loading vocabulary from data/vocab.t7 ...

Dataset stats:
  Vocabulary size: 35147
         Examples: 221282
libthclnn_searchpath    /Users/lfuelling/torch-cl/install/lib/lua/5.1/libTHCLNN.so
Using Apple , OpenCL platform: Apple
Using OpenCL device: AMD Radeon R9 M370X Compute Engine

-- Epoch 1 / 50  (LR= 0.001)

/Users/lfuelling/torch-cl/install/bin/luajit: bad argument #3 to '?' (number expected, got nil)
stack traceback:
    [C]: at 0x0f429f20
    [C]: in function '__newindex'
    ...ling/torch-cl/install/share/lua/5.1/clnn/LookupTable.lua:108: in function 'updateOutput'
    ...uelling/torch-cl/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
    train.lua:96: in function 'opfunc'
    .../lfuelling/torch-cl/install/share/lua/5.1/optim/adam.lua:33: in function 'adam'
    train.lua:131: in main chunk
    [C]: in function 'dofile'
    ...g/torch-cl/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
    [C]: at 0x010f0e1cf0

The error occurs here (in the last line):

function nn.LookupTable:updateOutput(input)
   if torch.type(input) ~= 'torch.ClTensor' then
      return self:baseUpdateOutput(input)
   end

   assert(not self.shouldScaleGradByFreq, 'self.shouldScaleGradByFreq not implemented')

   if self.size == nil then
      self.size = torch.LongStorage(2)
      self.size[1] = self.nIndex

Do you have any idea or ideally a solution for my problem?

Regards

Lukas

@hughperkins
Copy link
Owner

Well, step 1: I've installed neuroconvo, and can reproduce the problem :-)

@hughperkins hughperkins self-assigned this Aug 18, 2016
@hughperkins
Copy link
Owner

Ok, so root cause is: clnn's LookupTable relies on self.nIndex and self.nOutput being popuatled by the __init method, but torch nn LookupTable doesnt do that.

Fix is I need to somehow update clnn's LookupTable to be able to survive without these values being initialized, probalby by looking at the dimension of self.weight. Working on it...

hughperkins added a commit to hughperkins/clnn that referenced this issue Aug 18, 2016
hughperkins added a commit that referenced this issue Aug 18, 2016
@hughperkins
Copy link
Owner

So, I think this problem is fixed. Which uncovers a new one :-P

rnn/MaskZero.lua:72: invalid arguments: ByteTensor ClTensor number 
expected arguments: [*ClTensor*] ClTensor float | [*ClTensor*] ClTensor ClTensor
stack traceback:
        [C]: in function 'eq'
        /data/git/torch-cl/install/share/lua/5.1/rnn/MaskZero.lua:72: in function 'updateOutput'
        /data/git/torch-cl/install/share/lua/5.1/rnn/LSTM.lua:162: in function 'updateOutput'
        /data/git/torch-cl/install/share/lua/5.1/rnn/Sequencer.lua:59: in function 'updateOutput'
        /data/git/torch-cl/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'

@hughperkins
Copy link
Owner

Seems like ZeroMask.lua is somewhat coupled with cunn:

   self.zeroMask = self.zeroMask or ((torch.type(rmi) == 'torch.CudaTensor') and torch.CudaByteTensor() or torch.ByteTensor())

I may need to either fork rnn, or submit a patch, or probably both.

@lfuelling
Copy link
Author

Thank you so much. Let me know if I can help somehow.

@hughperkins
Copy link
Owner

Ok, so I've plausibly fixed the ZeroMask.lua, in draft. But now I see a new error, see below, so I'd better fix that one too :-P

/data/git/torch-cl/install/bin/luajit: /data/git/torch-cl/install/share/lua/5.1/nn/Select.lua:10: bad argument #3 to 'select' (out of range at /data/git/torch-cl/opencl/cltorch/src/lib/THClTensor.cpp:415)
stack traceback:
        [C]: in function 'select'
        /data/git/torch-cl/install/share/lua/5.1/nn/Select.lua:10: in function 'updateOutput'
        /data/git/torch-cl/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
        train.lua:95: in function 'opfunc'
        /data/git/torch-cl/install/share/lua/5.1/optim/adam.lua:33: in function 'adam'
        train.lua:131: in main chunk
        [C]: in function 'dofile'
        ...t/torch-cl/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
        [C]: at 0x00406670
exit status 1

@hughperkins
Copy link
Owner

for the select issue, seems like passing in -1 as a select index is now legal. I'll check this...

@hughperkins
Copy link
Owner

Oh, Select.lua has been updated to handle these:

   local dim = self.dimension < 0 and input:dim() + self.dimension + 1 or self.dimension
   local index = self.index < 0 and input:size(dim) + self.index + 1 or self.index

@hughperkins
Copy link
Owner

I think I fixed the Select issue. Next issue :-P

/data/git/torch-cl/install/bin/luajit: ...torch-cl/install/share/lua/5.1/rnn/MaskZeroCriterion.lua:66: invalid arguments: ClTensor ??? 
expected arguments: *ClTensor* string
stack traceback:
        [C]: in function 'apply'
        ...torch-cl/install/share/lua/5.1/rnn/MaskZeroCriterion.lua:66: in function 'forward'
        ...orch-cl/install/share/lua/5.1/rnn/SequencerCriterion.lua:50: in function 'forward'
        train.lua:98: in function 'opfunc'
        /data/git/torch-cl/install/share/lua/5.1/optim/adam.lua:33: in function 'adam'
        train.lua:131: in main chunk
        [C]: in function 'dofile'
        ...t/torch-cl/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
        [C]: at 0x00406670

@hughperkins
Copy link
Owner

Ah, another cuda-specific bit in rnn, this time in MaskZeroCriterion.lua:

   if torch.isTypeOf(zeroMask, 'torch.CudaTensor') then
      self.__zeroMask = self.__zeroMask or torch.FloatTensor()
      self.__zeroMask:resize(self._zeroMask:size()):copy(self._zeroMask)
      zeroMask = self._zeroMask

Taking a look...

@hughperkins
Copy link
Owner

Fixed the MaskZeroCrtierion issue... next issue :-P

/data/git/torch-cl/install/bin/luajit: train.lua:101: attempt to call field 'exit' (a nil value)
stack traceback:
        train.lua:101: in function 'opfunc'
        /data/git/torch-cl/install/share/lua/5.1/optim/adam.lua:33: in function 'adam'
        train.lua:134: in main chunk
        [C]: in function 'dofile'
        ...t/torch-cl/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
        [C]: at 0x00406670
exit status 1

Checking...

@hughperkins
Copy link
Owner

hughperkins commented Aug 20, 2016

(ah, thats cos I added this code into trian.lua, for debugging :-P

    sys.exit(1)

:-P)

@hughperkins
Copy link
Owner

Ok, it runs now, I think. You probably want to just reinstall distro-cl from scratch I think, on the whole... there were a whole ton of updates.

@hughperkins
Copy link
Owner

hughperkins commented Aug 20, 2016

Screenshot:
neuralconvo2

@Vincent717
Copy link

hi Hugh,

Thank you for fixing the bug so quickly, however, I still encounter a somewhat similar problem.
As you advised, I have reinstall distro-cl and it didn't fail to train with opencl this time. Unfortunately, it failed after finishing the first epoch.

Here is the stacktrace:

` [============================================= 97/97 19s55ms | Step: 219ms

-- Eval on validation..

Finished in 29s593ms 3.2777983081751 examples/sec.

Epoch stats:
Errors: min= 1.2164447917299
max= 4.1041326915975
median= 1.911158987776
mean= 2.1000249479894
std= 0.8865620521287
ppl= 8.1663736446291
val loss= 11.551878326818
val ppl= 103972.14719455

(Saving model ...)
-- Shuffling
/home/hwg/torch-cl/install/bin/luajit: /home/hwg/torch-cl/install/share/lua/5.1/nn/Module.lua:263: invalid arguments: ClTensor ByteTensor
expected arguments: [ClTensor] ClTensor ClTensor
stack traceback:
[C]: in function 'maskedSelect'
/home/hwg/torch-cl/install/share/lua/5.1/nn/Module.lua:263: in function 'flatten'
/home/hwg/torch-cl/install/share/lua/5.1/dpnn/Module.lua:205: in function 'getParameters'
train.lua:106: in main chunk
[C]: in function 'dofile'
...g/torch-cl/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x00406670
`

Seems like there are still something wrong with "ClTensor" in module.lua? Do you have any suggestion upon this?

Thank you again!

Best

Vincent

@hughperkins
Copy link
Owner

Ok. Thoughts on how I can reproduce this, without waiting an hour or so between each test? Some way of reproducing number of iterations etc?

@hughperkins
Copy link
Owner

(well, I will hack the loop I think, --dataset 30 doesnt seem to change anything)

@hughperkins
Copy link
Owner

Ok, can reproduce the issue:

(Saving model ...)
/data/torch-cl/install/bin/luajit: /data/torch-cl/install/share/lua/5.1/nn/Module.lua:263: invalid arguments: ClTensor ByteTensor 
expected arguments: [*ClTensor*] ClTensor ClTensor
stack traceback:
        [C]: in function 'maskedSelect'
        /data/torch-cl/install/share/lua/5.1/nn/Module.lua:263: in function 'flatten'
        /data/torch-cl/install/share/lua/5.1/dpnn/Module.lua:205: in function 'getParameters'
        train.lua:70: in main chunk
        [C]: in function 'dofile'
        ...a/torch-cl/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
        [C]: at 0x00406670
exit status 1

@hughperkins
Copy link
Owner

Hmmm. It uses maskedSelect, and maskedSelect is not implemented, because it uses thrust. Hmmm... pondering...

@hughperkins
Copy link
Owner

seems like one option to get this working might be to use boost.compute, which implements exclusive_scan

@hughperkins
Copy link
Owner

@Vincent717 Question: if the solution involved needing to install boost library core, to what extent would that be acceptable to you?

@Vincent717
Copy link

@hughperkins Thanks for your time, installing boost is totally ok for me. But it would be appreciated if you can explain a little bit more concrete details to me, because I am a newbie of opencl haha..

@lfuelling
Copy link
Author

I'm having the same issue now 😁

@hughperkins
Copy link
Owner

hughperkins commented Aug 22, 2016

Well... I implemented maskedSelect, https://github.com/hughperkins/cltorch/compare/add-maskedselect , but need to figure out how to get :maskedSelect to accept a ByteTensor. Mostly, the hard bit is done, for this particular bug, but not quite finished yet

[edit: I should say, Jacob wrote an implementatoin for me actually :-) https://github.com/boostorg/compute/issues/646#issuecomment-241282490 ]

@hughperkins
Copy link
Owner

So, fixed the bytetensor bit. New error :-)

/data/torch-cl/install/bin/luajit: /data/torch-cl/install/share/lua/5.1/nn/Module.lua:265: bad argument #2 to 'maskedSelect' (mask nElements exceeds single-precision float consecutive integer precision size (2^24) at /data/torch-cl/opencl/cltorch/src/lib/THClTensorMasked.cpp:205)
stack traceback:
        [C]: in function 'maskedSelect'
        /data/torch-cl/install/share/lua/5.1/nn/Module.lua:265: in function 'flatten'
        /data/torch-cl/install/share/lua/5.1/dpnn/Module.lua:205: in function 'getParameters'
        train.lua:70: in main chunk
        [C]: in function 'dofile'
        ...a/torch-cl/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
        [C]: at 0x

Checking...

@hughperkins
Copy link
Owner

hughperkins commented Aug 22, 2016

"fixed" that. new error, seg fault:

(Saving model ...)
nn Module torch.type(flatParameters)    torch.ClTensor
nn Module torch.type(maskParameters)    torch.ByteTensor
totalElements 33110448
Segmentation fault
exit status 139

Checking...

@lfuelling
Copy link
Author

lfuelling commented Aug 22, 2016

Welcome to debugging hell 😁

2014052418-17-17bgb debugging -8efe

BTW: I can also use neural-style since your fixes. I used to get an "Out of Memory" error with torch-cl.

@hughperkins
Copy link
Owner

Ok. Seems the segfault was related to my using bfboost. Using a full standard Titan X, no segfault. Logged a bug report with bfboost for the bfboost-related segfault. So, now onto the next bug :-P

/data/torch-cl/install/bin/luajit: /data/torch-cl/install/share/lua/5.1/nn/Module.lua:264: bad argument #1 to 'copy' (sizes do not match at /data/torch-cl/opencl/cltorch/src/lib/THClTensorCopy.cpp:136)
stack traceback:
        [C]: in function 'copy'
        /data/torch-cl/install/share/lua/5.1/nn/Module.lua:264: in function 'flatten'
        /data/torch-cl/install/share/lua/5.1/dpnn/Module.lua:205: in function 'getParameters'
        train.lua:70: in main chunk
        [C]: in function 'dofile'
        ...a/torch-cl/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
        [C]: at 0x00406670

@hughperkins
Copy link
Owner

BTW: I can also use neural-style since your fixes. I used to get an "Out of Memory" error with torch-cl.

Oh, cool :-) thats interesting.

@hughperkins
Copy link
Owner

hughperkins commented Aug 24, 2016

(Seems there is some bug in :sum() for these tensor sizes:

th> a = torch.ClTensor(66220894)
                                                                      [0.0001s]
th> a:storageOffset()
1   
                                                                      [0.0000s]
th> a:storage():size()
66220894    
                                                                      [0.0001s]
th> c = a:fill(1)
                                                                      [0.0012s]
th> a:sum()
66220896    

Checking... )

(edit: it's pretty bizarre, I've checked all numbers up to 1200000 so far (it's checking as I write...), and they all work ok. The first number I know of that it fails for is 33110447, whose sum is 33110448.)

@hughperkins
Copy link
Owner

I think it's working now, see screenshot below. Can you retry and let me know how that goes?

neuralconvoworking

@Vincent717
Copy link

ok let me see, do you mean we should follow your step to build the makeSelect by boost.compute blabla or just reinstall torch-cl?

@hughperkins
Copy link
Owner

The easiest way (ie least likely to fail), is to reinstall torch-cl, the whole thing.

If you already reinstalled since #8 (comment) , and you prefer hacking around over waiting for a full reinstall, you could try (unsupported... if it doesnt work, please do full torch-cl reinstall please :-) ) :

cd ~/torch-cl
git pull
git submodule update --recursive
cd opencl/cltorch
luarocks make rocks/cltorch-scm-1.rockspec

@Vincent717
Copy link

It works! This is so cool!! Thank you very much!
First time seeing someone debugging online, though mostly I can't follow..
but hope I can learn more from you. :-)

@hughperkins
Copy link
Owner

It works! This is so cool!! Thank you very much!

Cool :-)

First time seeing someone debugging online, though mostly I can't follow.. but hope I can learn more from you. :-)

:-)

@lfuelling
Copy link
Author

I can also confirm that it works although I'm not as far as saving a model. (Slow GPU)

@hughperkins
Copy link
Owner

Cool :-)

@hughperkins
Copy link
Owner

(background info for anyone using bfboost: bfboost now supports the OpenCL methods needed to run this, and this runs ok on bfboost now, without segfaulting)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants