bad argument #3 to '?' (lookupTable.lua:updateOutput) #8

lfuelling · 2016-08-18T11:08:33Z

Hi,

I'm trying to get neuralconvo to work with distro-cl. Unfortunately I'm getting an error when starting the training that I'm unable to solve and that seems to be coming from LookupTable.lua.

I already filed an issue in the project but it seems like it's not going to be fixed there.

First things first, here is the stacktrace:

> $ th train.lua --opencl --gpu 1                                                                                  [±master ●●]
-- Loading dataset
Loading vocabulary from data/vocab.t7 ...

Dataset stats:
  Vocabulary size: 35147
         Examples: 221282
libthclnn_searchpath    /Users/lfuelling/torch-cl/install/lib/lua/5.1/libTHCLNN.so
Using Apple , OpenCL platform: Apple
Using OpenCL device: AMD Radeon R9 M370X Compute Engine

-- Epoch 1 / 50  (LR= 0.001)

/Users/lfuelling/torch-cl/install/bin/luajit: bad argument #3 to '?' (number expected, got nil)
stack traceback:
    [C]: at 0x0f429f20
    [C]: in function '__newindex'
    ...ling/torch-cl/install/share/lua/5.1/clnn/LookupTable.lua:108: in function 'updateOutput'
    ...uelling/torch-cl/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
    train.lua:96: in function 'opfunc'
    .../lfuelling/torch-cl/install/share/lua/5.1/optim/adam.lua:33: in function 'adam'
    train.lua:131: in main chunk
    [C]: in function 'dofile'
    ...g/torch-cl/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
    [C]: at 0x010f0e1cf0

The error occurs here (in the last line):

function nn.LookupTable:updateOutput(input)
   if torch.type(input) ~= 'torch.ClTensor' then
      return self:baseUpdateOutput(input)
   end

   assert(not self.shouldScaleGradByFreq, 'self.shouldScaleGradByFreq not implemented')

   if self.size == nil then
      self.size = torch.LongStorage(2)
      self.size[1] = self.nIndex

Do you have any idea or ideally a solution for my problem?

Regards

Lukas

The text was updated successfully, but these errors were encountered:

hughperkins · 2016-08-18T13:07:00Z

Well, step 1: I've installed neuroconvo, and can reproduce the problem :-)

hughperkins · 2016-08-18T13:17:38Z

Ok, so root cause is: clnn's LookupTable relies on self.nIndex and self.nOutput being popuatled by the __init method, but torch nn LookupTable doesnt do that.

Fix is I need to somehow update clnn's LookupTable to be able to survive without these values being initialized, probalby by looking at the dimension of self.weight. Working on it...

hughperkins · 2016-08-18T13:28:08Z

So, I think this problem is fixed. Which uncovers a new one :-P

rnn/MaskZero.lua:72: invalid arguments: ByteTensor ClTensor number 
expected arguments: [*ClTensor*] ClTensor float | [*ClTensor*] ClTensor ClTensor
stack traceback:
        [C]: in function 'eq'
        /data/git/torch-cl/install/share/lua/5.1/rnn/MaskZero.lua:72: in function 'updateOutput'
        /data/git/torch-cl/install/share/lua/5.1/rnn/LSTM.lua:162: in function 'updateOutput'
        /data/git/torch-cl/install/share/lua/5.1/rnn/Sequencer.lua:59: in function 'updateOutput'
        /data/git/torch-cl/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'

hughperkins · 2016-08-18T13:36:03Z

Seems like ZeroMask.lua is somewhat coupled with cunn:

   self.zeroMask = self.zeroMask or ((torch.type(rmi) == 'torch.CudaTensor') and torch.CudaByteTensor() or torch.ByteTensor())

I may need to either fork rnn, or submit a patch, or probably both.

lfuelling · 2016-08-18T14:11:09Z

Thank you so much. Let me know if I can help somehow.

hughperkins · 2016-08-19T10:43:51Z

Ok, so I've plausibly fixed the ZeroMask.lua, in draft. But now I see a new error, see below, so I'd better fix that one too :-P

/data/git/torch-cl/install/bin/luajit: /data/git/torch-cl/install/share/lua/5.1/nn/Select.lua:10: bad argument #3 to 'select' (out of range at /data/git/torch-cl/opencl/cltorch/src/lib/THClTensor.cpp:415)
stack traceback:
        [C]: in function 'select'
        /data/git/torch-cl/install/share/lua/5.1/nn/Select.lua:10: in function 'updateOutput'
        /data/git/torch-cl/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
        train.lua:95: in function 'opfunc'
        /data/git/torch-cl/install/share/lua/5.1/optim/adam.lua:33: in function 'adam'
        train.lua:131: in main chunk
        [C]: in function 'dofile'
        ...t/torch-cl/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
        [C]: at 0x00406670
exit status 1

hughperkins · 2016-08-19T11:08:06Z

for the select issue, seems like passing in -1 as a select index is now legal. I'll check this...

hughperkins · 2016-08-19T11:11:20Z

Oh, Select.lua has been updated to handle these:

   local dim = self.dimension < 0 and input:dim() + self.dimension + 1 or self.dimension
   local index = self.index < 0 and input:size(dim) + self.index + 1 or self.index

hughperkins · 2016-08-19T11:19:22Z

I think I fixed the Select issue. Next issue :-P

/data/git/torch-cl/install/bin/luajit: ...torch-cl/install/share/lua/5.1/rnn/MaskZeroCriterion.lua:66: invalid arguments: ClTensor ??? 
expected arguments: *ClTensor* string
stack traceback:
        [C]: in function 'apply'
        ...torch-cl/install/share/lua/5.1/rnn/MaskZeroCriterion.lua:66: in function 'forward'
        ...orch-cl/install/share/lua/5.1/rnn/SequencerCriterion.lua:50: in function 'forward'
        train.lua:98: in function 'opfunc'
        /data/git/torch-cl/install/share/lua/5.1/optim/adam.lua:33: in function 'adam'
        train.lua:131: in main chunk
        [C]: in function 'dofile'
        ...t/torch-cl/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
        [C]: at 0x00406670

hughperkins · 2016-08-19T11:24:37Z

Ah, another cuda-specific bit in rnn, this time in MaskZeroCriterion.lua:

   if torch.isTypeOf(zeroMask, 'torch.CudaTensor') then
      self.__zeroMask = self.__zeroMask or torch.FloatTensor()
      self.__zeroMask:resize(self._zeroMask:size()):copy(self._zeroMask)
      zeroMask = self._zeroMask

Taking a look...

hughperkins · 2016-08-20T11:53:44Z

Fixed the MaskZeroCrtierion issue... next issue :-P

/data/git/torch-cl/install/bin/luajit: train.lua:101: attempt to call field 'exit' (a nil value)
stack traceback:
        train.lua:101: in function 'opfunc'
        /data/git/torch-cl/install/share/lua/5.1/optim/adam.lua:33: in function 'adam'
        train.lua:134: in main chunk
        [C]: in function 'dofile'
        ...t/torch-cl/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
        [C]: at 0x00406670
exit status 1

Checking...

hughperkins · 2016-08-20T11:54:16Z

(ah, thats cos I added this code into trian.lua, for debugging :-P

    sys.exit(1)

:-P)

hughperkins · 2016-08-20T11:55:22Z

Ok, it runs now, I think. You probably want to just reinstall distro-cl from scratch I think, on the whole... there were a whole ton of updates.

hughperkins · 2016-08-20T11:58:41Z

Screenshot:

Vincent717 · 2016-08-21T16:18:57Z

hi Hugh,

Thank you for fixing the bug so quickly, however, I still encounter a somewhat similar problem.
As you advised, I have reinstall distro-cl and it didn't fail to train with opencl this time. Unfortunately, it failed after finishing the first epoch.

Here is the stacktrace:

` [============================================= 97/97 19s55ms | Step: 219ms

-- Eval on validation..

Finished in 29s593ms 3.2777983081751 examples/sec.

Epoch stats:
Errors: min= 1.2164447917299
max= 4.1041326915975
median= 1.911158987776
mean= 2.1000249479894
std= 0.8865620521287
ppl= 8.1663736446291
val loss= 11.551878326818
val ppl= 103972.14719455

(Saving model ...)
-- Shuffling
/home/hwg/torch-cl/install/bin/luajit: /home/hwg/torch-cl/install/share/lua/5.1/nn/Module.lua:263: invalid arguments: ClTensor ByteTensor
expected arguments: [ClTensor] ClTensor ClTensor
stack traceback:
[C]: in function 'maskedSelect'
/home/hwg/torch-cl/install/share/lua/5.1/nn/Module.lua:263: in function 'flatten'
/home/hwg/torch-cl/install/share/lua/5.1/dpnn/Module.lua:205: in function 'getParameters'
train.lua:106: in main chunk
[C]: in function 'dofile'
...g/torch-cl/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x00406670
`

Seems like there are still something wrong with "ClTensor" in module.lua? Do you have any suggestion upon this?

Thank you again!

Best

Vincent

hughperkins · 2016-08-21T18:50:46Z

Ok. Thoughts on how I can reproduce this, without waiting an hour or so between each test? Some way of reproducing number of iterations etc?

hughperkins · 2016-08-21T18:56:25Z

(well, I will hack the loop I think, --dataset 30 doesnt seem to change anything)

hughperkins · 2016-08-21T18:58:18Z

Ok, can reproduce the issue:

(Saving model ...)
/data/torch-cl/install/bin/luajit: /data/torch-cl/install/share/lua/5.1/nn/Module.lua:263: invalid arguments: ClTensor ByteTensor 
expected arguments: [*ClTensor*] ClTensor ClTensor
stack traceback:
        [C]: in function 'maskedSelect'
        /data/torch-cl/install/share/lua/5.1/nn/Module.lua:263: in function 'flatten'
        /data/torch-cl/install/share/lua/5.1/dpnn/Module.lua:205: in function 'getParameters'
        train.lua:70: in main chunk
        [C]: in function 'dofile'
        ...a/torch-cl/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
        [C]: at 0x00406670
exit status 1

hughperkins · 2016-08-21T19:33:55Z

Hmmm. It uses maskedSelect, and maskedSelect is not implemented, because it uses thrust. Hmmm... pondering...

hughperkins · 2016-08-21T19:46:51Z

seems like one option to get this working might be to use boost.compute, which implements exclusive_scan

hughperkins · 2016-08-21T22:02:24Z

@Vincent717 Question: if the solution involved needing to install boost library core, to what extent would that be acceptable to you?

Vincent717 · 2016-08-22T02:12:39Z

@hughperkins Thanks for your time, installing boost is totally ok for me. But it would be appreciated if you can explain a little bit more concrete details to me, because I am a newbie of opencl haha..

lfuelling · 2016-08-22T09:22:45Z

I'm having the same issue now 😁

hughperkins · 2016-08-22T12:07:56Z

Well... I implemented maskedSelect, https://github.com/hughperkins/cltorch/compare/add-maskedselect , but need to figure out how to get :maskedSelect to accept a ByteTensor. Mostly, the hard bit is done, for this particular bug, but not quite finished yet

[edit: I should say, Jacob wrote an implementatoin for me actually :-) https://github.com/boostorg/compute/issues/646#issuecomment-241282490 ]

hughperkins · 2016-08-22T14:01:04Z

So, fixed the bytetensor bit. New error :-)

/data/torch-cl/install/bin/luajit: /data/torch-cl/install/share/lua/5.1/nn/Module.lua:265: bad argument #2 to 'maskedSelect' (mask nElements exceeds single-precision float consecutive integer precision size (2^24) at /data/torch-cl/opencl/cltorch/src/lib/THClTensorMasked.cpp:205)
stack traceback:
        [C]: in function 'maskedSelect'
        /data/torch-cl/install/share/lua/5.1/nn/Module.lua:265: in function 'flatten'
        /data/torch-cl/install/share/lua/5.1/dpnn/Module.lua:205: in function 'getParameters'
        train.lua:70: in main chunk
        [C]: in function 'dofile'
        ...a/torch-cl/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
        [C]: at 0x

Checking...

hughperkins · 2016-08-22T14:03:57Z

"fixed" that. new error, seg fault:

(Saving model ...)
nn Module torch.type(flatParameters)    torch.ClTensor
nn Module torch.type(maskParameters)    torch.ByteTensor
totalElements 33110448
Segmentation fault
exit status 139

Checking...

lfuelling · 2016-08-22T14:53:04Z

Welcome to debugging hell 😁

BTW: I can also use neural-style since your fixes. I used to get an "Out of Memory" error with torch-cl.

hughperkins · 2016-08-24T11:08:47Z

Ok. Seems the segfault was related to my using bfboost. Using a full standard Titan X, no segfault. Logged a bug report with bfboost for the bfboost-related segfault. So, now onto the next bug :-P

/data/torch-cl/install/bin/luajit: /data/torch-cl/install/share/lua/5.1/nn/Module.lua:264: bad argument #1 to 'copy' (sizes do not match at /data/torch-cl/opencl/cltorch/src/lib/THClTensorCopy.cpp:136)
stack traceback:
        [C]: in function 'copy'
        /data/torch-cl/install/share/lua/5.1/nn/Module.lua:264: in function 'flatten'
        /data/torch-cl/install/share/lua/5.1/dpnn/Module.lua:205: in function 'getParameters'
        train.lua:70: in main chunk
        [C]: in function 'dofile'
        ...a/torch-cl/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
        [C]: at 0x00406670

hughperkins · 2016-08-24T11:09:17Z

BTW: I can also use neural-style since your fixes. I used to get an "Out of Memory" error with torch-cl.

Oh, cool :-) thats interesting.

hughperkins · 2016-08-24T14:04:46Z

(Seems there is some bug in :sum() for these tensor sizes:

th> a = torch.ClTensor(66220894)
                                                                      [0.0001s]
th> a:storageOffset()
1   
                                                                      [0.0000s]
th> a:storage():size()
66220894    
                                                                      [0.0001s]
th> c = a:fill(1)
                                                                      [0.0012s]
th> a:sum()
66220896

Checking... )

(edit: it's pretty bizarre, I've checked all numbers up to 1200000 so far (it's checking as I write...), and they all work ok. The first number I know of that it fails for is 33110447, whose sum is 33110448.)

hughperkins · 2016-08-25T13:31:18Z

I think it's working now, see screenshot below. Can you retry and let me know how that goes?

Vincent717 · 2016-08-25T13:44:34Z

ok let me see, do you mean we should follow your step to build the makeSelect by boost.compute blabla or just reinstall torch-cl?

hughperkins · 2016-08-25T13:51:09Z

The easiest way (ie least likely to fail), is to reinstall torch-cl, the whole thing.

If you already reinstalled since #8 (comment) , and you prefer hacking around over waiting for a full reinstall, you could try (unsupported... if it doesnt work, please do full torch-cl reinstall please :-) ) :

cd ~/torch-cl
git pull
git submodule update --recursive
cd opencl/cltorch
luarocks make rocks/cltorch-scm-1.rockspec

Vincent717 · 2016-08-25T15:38:55Z

It works! This is so cool!! Thank you very much!
First time seeing someone debugging online, though mostly I can't follow..
but hope I can learn more from you. :-)

hughperkins · 2016-08-25T23:16:27Z

It works! This is so cool!! Thank you very much!

Cool :-)

First time seeing someone debugging online, though mostly I can't follow.. but hope I can learn more from you. :-)

:-)

lfuelling · 2016-08-26T11:00:27Z

I can also confirm that it works although I'm not as far as saving a model. (Slow GPU)

hughperkins · 2016-08-26T12:03:23Z

Cool :-)

hughperkins · 2016-09-04T13:03:09Z

(background info for anyone using bfboost: bfboost now supports the OpenCL methods needed to run this, and this runs ok on bfboost now, without segfaulting)

hughperkins self-assigned this Aug 18, 2016

hughperkins added the bug label Aug 18, 2016

hughperkins added a commit to hughperkins/clnn that referenced this issue Aug 18, 2016

Address hughperkins/distro-cl#8

78b41df

hughperkins added a commit that referenced this issue Aug 18, 2016

Address #8

62a9420

hughperkins mentioned this issue Aug 20, 2016

Add opencl Element-Research/rnn#327

Merged

hughperkins mentioned this issue Aug 24, 2016

How to implement maskedSelect using boost.compute, in opencl? boostorg/compute#646

Closed

hughperkins closed this as completed Aug 26, 2016

bad argument #3 to '?' (lookupTable.lua:updateOutput) #8

bad argument #3 to '?' (lookupTable.lua:updateOutput) #8

Comments

lfuelling commented Aug 18, 2016

hughperkins commented Aug 18, 2016

hughperkins commented Aug 18, 2016

hughperkins commented Aug 18, 2016

hughperkins commented Aug 18, 2016

lfuelling commented Aug 18, 2016

hughperkins commented Aug 19, 2016

hughperkins commented Aug 19, 2016

hughperkins commented Aug 19, 2016

hughperkins commented Aug 19, 2016

hughperkins commented Aug 19, 2016

hughperkins commented Aug 20, 2016

hughperkins commented Aug 20, 2016 • edited Loading

hughperkins commented Aug 20, 2016

hughperkins commented Aug 20, 2016 • edited Loading

Vincent717 commented Aug 21, 2016

hughperkins commented Aug 21, 2016

hughperkins commented Aug 21, 2016

hughperkins commented Aug 21, 2016

hughperkins commented Aug 21, 2016

hughperkins commented Aug 21, 2016

hughperkins commented Aug 21, 2016

Vincent717 commented Aug 22, 2016

lfuelling commented Aug 22, 2016

hughperkins commented Aug 22, 2016 • edited Loading

hughperkins commented Aug 22, 2016

hughperkins commented Aug 22, 2016 • edited Loading

lfuelling commented Aug 22, 2016 • edited Loading

hughperkins commented Aug 24, 2016

hughperkins commented Aug 24, 2016

hughperkins commented Aug 24, 2016 • edited Loading

hughperkins commented Aug 25, 2016

Vincent717 commented Aug 25, 2016

hughperkins commented Aug 25, 2016

Vincent717 commented Aug 25, 2016

hughperkins commented Aug 25, 2016

lfuelling commented Aug 26, 2016

hughperkins commented Aug 26, 2016

hughperkins commented Sep 4, 2016

hughperkins commented Aug 20, 2016 •

edited

Loading

hughperkins commented Aug 20, 2016 •

edited

Loading

hughperkins commented Aug 22, 2016 •

edited

Loading

hughperkins commented Aug 22, 2016 •

edited

Loading

lfuelling commented Aug 22, 2016 •

edited

Loading

hughperkins commented Aug 24, 2016 •

edited

Loading