opencl training fail #30

SolarPeng · 2016-06-01T08:17:22Z

I have never be successful on training.

th train.lua --opencl --dataset 50000 --hiddenSize 1000

-- Loading dataset
Loading vocabulary from data/vocab.t7 ...

Dataset stats:
Vocabulary size: 25931
Examples: 83632
libthclnn_searchpath /Users/SolarKing/Dev/torch-cl/install/lib/lua/5.1/libTHCLNN.so
Using Apple , OpenCL platform: Apple
Using OpenCL device: GeForce 9400M

-- Epoch 1 / 50

/Users/SolarKing/Dev/torch/install/bin/luajit: ...larKing/Dev/torch/install/share/lua/5.1/nn/Container.lua:67:
In 1 module of nn.Sequential:
bad argument #3 to '?' (number expected, got nil)
stack traceback:
[C]: at 0x0ebe4500
[C]: in function '__newindex'
.../Dev/torch-cl/install/share/lua/5.1/clnn/LookupTable.lua:108: in function <.../Dev/torch-cl/install/share/lua/5.1/clnn/LookupTable.lua:99>
[C]: in function 'xpcall'
...larKing/Dev/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
...arKing/Dev/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
./seq2seq.lua:71: in function 'train'
train.lua:85: in main chunk
[C]: in function 'dofile'
.../Dev/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x010e8bbbb0

WARNING: If you see a stack trace below, it doesn't point to the place where this error occured. Please use only the one above.
stack traceback:
[C]: in function 'error'
...larKing/Dev/torch/install/share/lua/5.1/nn/Container.lua:67: in function 'rethrowErrors'
...arKing/Dev/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
./seq2seq.lua:71: in function 'train'
train.lua:85: in main chunk
[C]: in function 'dofile'
.../Dev/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x010e8bbbb0

lfuelling · 2016-06-20T11:52:17Z

I'm also using torch-cl, following the tutorial there, you shouldn't install nn, cudnn, cldnn, etc. because it break the installation. The only things I installed were rnn and penlight.

Got something similar:

lerk@blrg:~/workspace/neuralconvo$ th train.lua --opencl
-- Loading dataset
Loading vocabulary from data/vocab.t7 ...

Dataset stats:
  Vocabulary size: 35147
         Examples: 221282
libthclnn_searchpath    /home/lerk/torch-cl/install/lib/lua/5.1/libTHCLNN.so
Using NVIDIA Corporation , OpenCL platform: NVIDIA CUDA
Using OpenCL device: GeForce GTX 660

-- Epoch 1 / 50

/home/lerk/torch-cl/install/bin/luajit: bad argument #3 to '?' (number expected, got nil)
stack traceback:
    [C]: at 0x7f142d00baa0
    [C]: in function '__newindex'
    ...lerk/torch-cl/install/share/lua/5.1/clnn/LookupTable.lua:108: in function 'updateOutput'
    /home/lerk/torch-cl/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
    ./seq2seq.lua:66: in function 'train'
    train.lua:88: in main chunk
    [C]: in function 'dofile'
    ...k/torch-cl/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
    [C]: at 0x00405e90

UPDATE: I also tried this on my MacBook Pro, same error:

lerk@blackreach ~/workspace/neuralconvo                                                                                                         [14:33:45]
> $ th train.lua --opencl                                                                                                                      [±master ✓]
-- Loading dataset
data/vocab.t7 not found
-- Parsing Cornell movie dialogs data set ...
 [=============================================================== 387810/387810 =======>] Tot: 1s615ms | Step: 0ms
-- Pre-processing data
 [================================================================ 166194/166194 ======>] Tot: 31s885ms | Step: 0ms
-- Removing low frequency words
 [================================================================ 221282/221282 ======>] Tot: 14s809ms | Step: 0ms
Writing data/examples.t7 ...
 [=============================================================== 221282/221282 =======>] Tot: 33s43ms | Step: 0ms
Writing data/vocab.t7 ...

Dataset stats:
  Vocabulary size: 35147
         Examples: 221282
libthclnn_searchpath    /Users/lerk/torch-cl/install/lib/lua/5.1/libTHCLNN.so
Using Apple , OpenCL platform: Apple
Using OpenCL device: ATI Radeon HD 6770M

-- Epoch 1 / 50

/Users/lerk/torch-cl/install/bin/luajit: bad argument #3 to '?' (number expected, got nil)
stack traceback:
    [C]: at 0x05350f40
    [C]: in function '__newindex'
    ...lerk/torch-cl/install/share/lua/5.1/clnn/LookupTable.lua:108: in function 'updateOutput'
    ...rs/lerk/torch-cl/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
    ./seq2seq.lua:66: in function 'train'
    train.lua:88: in main chunk
    [C]: in function 'dofile'
    ...k/torch-cl/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
    [C]: at 0x0105008d00

lfuelling · 2016-06-20T15:45:25Z

The stacktrace suggests that the Error is on this line:

local encoderOutput = self.encoder:forward(encoderInputs)

~~I tried to locate the error and I think it's on line 88 in train.lua:~~

model:train(encInputs, decInputs, decTargets)

~~Could it be that #29 introduced this? It's the latest change on this line. previously it was:~~

local err = model:train(input, target)

~~I'll try to fix this somehow (I don't even know lua) and get back here then.~~

UPDATE: I checked out the last commit before the merge and I got the same error again. Only the hex numbers differ:

lerk@blrg:~/workspace/neuralconvo$ th train.lua --opencl
-- Loading dataset
data/vocab.t7 not found
-- Parsing Cornell movie dialogs data set ...
 [=============================================================== 387810/387810 =======>] Tot: 1s391ms | Step: 0ms
-- Pre-processing data
 [============================================================= 166194/166194 =========>] Tot: 5m14s | Step: 0ms
-- Removing low frequency words
 [============================================================ 221282/221282 ==========>] Tot: 7m6s | Step: 1ms
Writing data/examples.t7 ...
 [============================================================ 221282/221282 ==========>] Tot: 7m4s | Step: 5ms
Writing data/vocab.t7 ...

Dataset stats:
  Vocabulary size: 35147
         Examples: 221282
libthclnn_searchpath    /home/lerk/torch-cl/install/lib/lua/5.1/libTHCLNN.so
Using NVIDIA Corporation , OpenCL platform: NVIDIA CUDA
Using OpenCL device: GeForce GTX 660

-- Epoch 1 / 50

/home/lerk/torch-cl/install/bin/luajit: bad argument #3 to '?' (number expected, got nil)
stack traceback:
    [C]: at 0x7fcd865b3aa0
    [C]: in function '__newindex'
    ...lerk/torch-cl/install/share/lua/5.1/clnn/LookupTable.lua:108: in function 'updateOutput'
    /home/lerk/torch-cl/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
    ./seq2seq.lua:66: in function 'train'
    train.lua:88: in main chunk
    [C]: in function 'dofile'
    ...k/torch-cl/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
    [C]: at 0x00405e90

mgomes · 2016-07-07T14:26:43Z

I am hitting this as well. I think something changed in the last month or so where the cltorch and clnn modules are no longer supported via luarocks. Instead you have to use the torch-cl distro.

The problem is coming from train.lua:70 in model:getParameters(). That no longer returns the parameters. I'm still looking into it.

lfuelling · 2016-08-18T09:29:58Z

@mgomes you got anything so far?

I tried this again and stumbled upon the following part in the official torch source:

function optim.adam(opfunc, x, config, state)
   -- (0) get/update state
   local config = config or {}
   local state = state or config
   local lr = config.learningRate or 0.001
   local lrd = config.learningRateDecay or 0

   local beta1 = config.beta1 or 0.9
   local beta2 = config.beta2 or 0.999
   local epsilon = config.epsilon or 1e-8

In the stacktrace, the epsilon allocation is mentioned to being a nil value while expecting a number. I assume that in cltoroch (distro-cl) there is no default value for this but I am unable to find the file in cltorch.

The config object that gets passed to the function above is the following:

{
  momentum : 0.9
  learningRate : 0.001
}

Here's another stacktrace:

> $ th train.lua --opencl                                                                                           [±master ●]
-- Loading dataset
Loading vocabulary from data/vocab.t7 ...

Dataset stats:
  Vocabulary size: 35147
         Examples: 221282
libthclnn_searchpath    /Users/lfuelling/torch-cl/install/lib/lua/5.1/libTHCLNN.so
Using Apple , OpenCL platform: Apple
Using OpenCL device: Iris Pro

-- Epoch 1 / 50  (LR= 0.001)

{
  momentum : 0.9
  learningRate : 0.001
}
/Users/lfuelling/torch-cl/install/bin/luajit: bad argument #3 to '?' (number expected, got nil)
stack traceback:
    [C]: at 0x01deef20
    [C]: in function '__newindex'
    ...ling/torch-cl/install/share/lua/5.1/clnn/LookupTable.lua:108: in function 'updateOutput'
    ...uelling/torch-cl/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
    train.lua:93: in function 'opfunc'
    .../lfuelling/torch-cl/install/share/lua/5.1/optim/adam.lua:33: in function 'adam'
    train.lua:129: in main chunk
    [C]: in function 'dofile'
    ...g/torch-cl/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
    [C]: at 0x0101aa6cf0

UPDATE: I'm stupid. If you read the stacktrace, you'll notice Using OpenCL device: Iris Pro. I bet it works when I use the external GPU. neural-style has an option to set the GPU you want. I'll try to implement this.

hughperkins · 2016-08-21T12:43:15Z

Ok, I fixed a bunch of bugs yesterday. I think the easiest thing to do will be to simply reinstall distro-cl, since there were a bunch of fixes, and specifically, rnn is pinned now, via rocks-cl, which implies a change to your torch-cl/install/etc/luarocks/config.lua file, to have one adiditonal rocks_server, as follows:

rocks_servers = {
   [[https://raw.githubusercontent.com/hughperkins/rocks-cl/master]],
   [[https://raw.githubusercontent.com/torch/rocks/master]],
   [[https://raw.githubusercontent.com/rocks-moonscript-org/moonrocks-mirror/master]]
}

There was also a change to the exe/luajit-rocks submodule, to point to https://github.com/hughperkins/luajit-rocks , to hold this configuration.

I just now tested a full fresh reinstallation, using hte following commands:

git clone --recursive https://github.com/hughperkins/distro-cl torch-cl
cd torch-cl
bash install.sh -b
source /data/torch-cl/install/bin/torch-activate  # normally this would be ~/torch-cl/... for you
luarocks install rnn
luarocks install torchx
cd /data/git/neuralconvo
bfboost client -r th train.lua --opencl   # you wont need/want the `bfboost client -r` bit, this is just because I'm running on bfboost
# et voila, running, see screenshot

Screenshot:

lfuelling · 2016-08-22T17:27:54Z

For those too lazy to read the file: -b doesn't prompt for anything. Watch your .whateverrc after the install to remove duplicate entries of torch-activate.

hughperkins · 2016-08-23T12:13:36Z

its not working yet .... I'm still trying to fix it. I got as far as maskedSelect being implemented, but it currently causes a segfault under the present scenario, which I need to look into. I think you might as well leave this open for now really?

lfuelling · 2016-08-23T13:12:13Z

I think it was automatically closed. Ping @macournoyer

macournoyer · 2016-08-23T13:25:54Z

Ooops! Autoclosed indeed.

hughperkins · 2016-08-25T13:32:04Z

Might be working now. Can you pull down latest updates to distro-cl, and retry?

This was referenced Aug 18, 2016

add gpu selection for opencl/cuda use #49

Merged

bad argument #3 to '?' (lookupTable.lua:updateOutput) hughperkins/distro-cl#8

Closed

macournoyer closed this as completed in #49 Aug 23, 2016

macournoyer reopened this Aug 23, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

opencl training fail #30

opencl training fail #30

SolarPeng commented Jun 1, 2016

lfuelling commented Jun 20, 2016 •

edited

Loading

lfuelling commented Jun 20, 2016 •

edited

Loading

mgomes commented Jul 7, 2016

lfuelling commented Aug 18, 2016 •

edited

Loading

hughperkins commented Aug 21, 2016

lfuelling commented Aug 22, 2016

hughperkins commented Aug 23, 2016

lfuelling commented Aug 23, 2016

macournoyer commented Aug 23, 2016

hughperkins commented Aug 25, 2016

opencl training fail #30

opencl training fail #30

Comments

SolarPeng commented Jun 1, 2016

I have never be successful on training.

th train.lua --opencl --dataset 50000 --hiddenSize 1000

lfuelling commented Jun 20, 2016 • edited Loading

lfuelling commented Jun 20, 2016 • edited Loading

mgomes commented Jul 7, 2016

lfuelling commented Aug 18, 2016 • edited Loading

hughperkins commented Aug 21, 2016

lfuelling commented Aug 22, 2016

hughperkins commented Aug 23, 2016

lfuelling commented Aug 23, 2016

macournoyer commented Aug 23, 2016

hughperkins commented Aug 25, 2016

lfuelling commented Jun 20, 2016 •

edited

Loading

lfuelling commented Jun 20, 2016 •

edited

Loading

lfuelling commented Aug 18, 2016 •

edited

Loading