You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When I train xtransformer with pecos model, a training error occurs in the matcher stage.
the size of dataset is 108457, Hierarchical label tree: [32, 1102]。In the matcher stage, when I was training the second layer of label trees(There is no problem when training the first layer of label trees), after the matcher fine-tuning was completed, it got stuck when predicting the training data, look pecos.xmc.xtransformer.matcher
I think it is caused by my training data set is too large,so I modified the code snippet of pecos.xmc.xtransformer.matcher。
But another problem happened, see the training log below。
05/08/2023 10:31:56 - INFO - pecos.xmc.xtransformer.matcher - Reload the best checkpoint from /tmp/tmp0kdzh7n5
05/08/2023 10:31:58 - INFO - pecos.xmc.xtransformer.matcher - Predict with csr_codes_next((30000, 1102)) with avr_nnz=172.31423333333333
05/08/2023 10:31:58 - INFO - pecos.xmc.xtransformer.module - Constructed XMCTextTensorizer, tokenized=True, len=30000
05/08/2023 10:32:29 - INFO - pecos.xmc.xtransformer.matcher - Predict with csr_codes_next((30000, 1102)) with avr_nnz=172.2335
05/08/2023 10:32:29 - INFO - pecos.xmc.xtransformer.module - Constructed XMCTextTensorizer, tokenized=True, len=30000
Traceback (most recent call last):
File "/opt/conda/envs/nlp/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/envs/nlp/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/opt/conda/envs/nlp/lib/python3.8/site-packages/pecos/xmc/xtransformer/train.py", line 564, in
do_train(args)
File "/opt/conda/envs/nlp/lib/python3.8/site-packages/pecos/xmc/xtransformer/train.py", line 548, in do_train
xtf = XTransformer.train(
File "/opt/conda/envs/nlp/lib/python3.8/site-packages/pecos/xmc/xtransformer/model.py", line 447, in train
res_dict = TransformerMatcher.train(
File "/opt/conda/envs/nlp/lib/python3.8/site-packages/pecos/xmc/xtransformer/matcher.py", line 1402, in train
P_trn, inst_embeddings = matcher.predict(
File "/opt/conda/envs/nlp/lib/python3.8/site-packages/pecos/xmc/xtransformer/matcher.py", line 662, in predict
cur_P, cur_embedding = self._predict(
File "/opt/conda/envs/nlp/lib/python3.8/site-packages/pecos/xmc/xtransformer/matcher.py", line 812, in _predict
cur_act_labels = csr_codes_next[inputs["instance_number"].cpu()]
File "/opt/conda/envs/nlp/lib/python3.8/site-packages/scipy/sparse/_index.py", line 47, in getitem
row, col = self._validate_indices(key)
File "/opt/conda/envs/nlp/lib/python3.8/site-packages/scipy/sparse/_index.py", line 159, in _validate_indices
row = self._asindices(row, M)
File "/opt/conda/envs/nlp/lib/python3.8/site-packages/scipy/sparse/_index.py", line 191, in _asindices
raise IndexError('index (%d) out of range' % max_indx)
IndexError: index (30255) out of range
I'm not sure if this is a bug, can you give me some advice? Thanks!
Environment
Operating system: Ubuntu 20.04.4 LTS container
Python version: Python 3.8.16
PECOS version:libpecos 1.0.0
The text was updated successfully, but these errors were encountered:
Hi xiaokening, the issue is caused by pre-tensorized prob.X_text has larger instance index than the partitioned chunk size (30000). This should not happen if prob.X_text is not tensorized (list of str).
If you want to manually truncated predict, one simple workaround is to turn off the train_params.pre_tokenize so every chunk of data will be tensorized independently.
@jiong-zhang When I train xtransformer with pecos model, the same training error occurs in the matcher stage. At first I thought that my data volume was too large, but when I increased the memory, this problem would still appear. This problem may occur in any matcher stage(I don't manually truncate predict)
I use the top and free commands to monitor the running of the program. I found that the number of processes suddenly increased and then disappeared. I suspect it is a problem with the dataloader. You can refer to this link
note:after the matcher fine-tuning was completed, it got stuck when predicting the training data at first step, look pecos.xmc.xtransformer.matcher
can you give me some adivce? Thanks
Environment
Operating system: Ubuntu 20.04.4 LTS container
Python version: Python 3.8.16
PECOS version:libpecos 1.0.0
pytorch version: pytorch==1.11.0
cuda version: 4 x NVIDIA V100 16GB;cudatoolkit=11.3
Description
When I train xtransformer with pecos model, a training error occurs in the matcher stage.
the size of dataset is 108457, Hierarchical label tree: [32, 1102]。In the matcher stage, when I was training the second layer of label trees(There is no problem when training the first layer of label trees), after the matcher fine-tuning was completed, it got stuck when predicting the training data, look pecos.xmc.xtransformer.matcher
I think it is caused by my training data set is too large,so I modified the code snippet of pecos.xmc.xtransformer.matcher。
But another problem happened, see the training log below。
05/08/2023 10:31:56 - INFO - pecos.xmc.xtransformer.matcher - Reload the best checkpoint from /tmp/tmp0kdzh7n5
05/08/2023 10:31:58 - INFO - pecos.xmc.xtransformer.matcher - Predict with csr_codes_next((30000, 1102)) with avr_nnz=172.31423333333333
05/08/2023 10:31:58 - INFO - pecos.xmc.xtransformer.module - Constructed XMCTextTensorizer, tokenized=True, len=30000
05/08/2023 10:32:29 - INFO - pecos.xmc.xtransformer.matcher - Predict with csr_codes_next((30000, 1102)) with avr_nnz=172.2335
05/08/2023 10:32:29 - INFO - pecos.xmc.xtransformer.module - Constructed XMCTextTensorizer, tokenized=True, len=30000
Traceback (most recent call last):
File "/opt/conda/envs/nlp/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/envs/nlp/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/opt/conda/envs/nlp/lib/python3.8/site-packages/pecos/xmc/xtransformer/train.py", line 564, in
do_train(args)
File "/opt/conda/envs/nlp/lib/python3.8/site-packages/pecos/xmc/xtransformer/train.py", line 548, in do_train
xtf = XTransformer.train(
File "/opt/conda/envs/nlp/lib/python3.8/site-packages/pecos/xmc/xtransformer/model.py", line 447, in train
res_dict = TransformerMatcher.train(
File "/opt/conda/envs/nlp/lib/python3.8/site-packages/pecos/xmc/xtransformer/matcher.py", line 1402, in train
P_trn, inst_embeddings = matcher.predict(
File "/opt/conda/envs/nlp/lib/python3.8/site-packages/pecos/xmc/xtransformer/matcher.py", line 662, in predict
cur_P, cur_embedding = self._predict(
File "/opt/conda/envs/nlp/lib/python3.8/site-packages/pecos/xmc/xtransformer/matcher.py", line 812, in _predict
cur_act_labels = csr_codes_next[inputs["instance_number"].cpu()]
File "/opt/conda/envs/nlp/lib/python3.8/site-packages/scipy/sparse/_index.py", line 47, in getitem
row, col = self._validate_indices(key)
File "/opt/conda/envs/nlp/lib/python3.8/site-packages/scipy/sparse/_index.py", line 159, in _validate_indices
row = self._asindices(row, M)
File "/opt/conda/envs/nlp/lib/python3.8/site-packages/scipy/sparse/_index.py", line 191, in _asindices
raise IndexError('index (%d) out of range' % max_indx)
IndexError: index (30255) out of range
I'm not sure if this is a bug, can you give me some advice? Thanks!
Environment
The text was updated successfully, but these errors were encountered: