Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integration test requiring training via Ludwig failing on personal machine #1028

Open
1 of 2 tasks
hershd23 opened this issue Sep 1, 2023 · 8 comments
Open
1 of 2 tasks
Labels
Bug 🐞 EVA is not working as expected Crash 💥 EVA is crashing

Comments

@hershd23
Copy link
Contributor

hershd23 commented Sep 1, 2023

Search before asking

  • I have searched the EvaDB issues and found no similar bug report.

Bug

$ ~ PYTHONPATH="." python -m pytest test/integration_tests/long/test_model_train.py -k 'test_ludwig_automl'
ERROR    evadb.utils.logging_manager:plan_executor.py:182 Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument weight in method wrapper_CUDA__native_batch_norm)
Traceback (most recent call last):
  File "/home/hershd23/Desktop/evadb/evadb/executor/plan_executor.py", line 178, in execute_plan
    yield from output
  File "/home/hershd23/Desktop/evadb/evadb/executor/project_executor.py", line 34, in exec
    batch = apply_project(batch, self.target_list, self.catalog())
  File "/home/hershd23/Desktop/evadb/evadb/executor/executor_utils.py", line 42, in apply_project
    batches = [expr.evaluate(batch) for expr in project_list]
  File "/home/hershd23/Desktop/evadb/evadb/executor/executor_utils.py", line 42, in <listcomp>
    batches = [expr.evaluate(batch) for expr in project_list]
  File "/home/hershd23/Desktop/evadb/evadb/expression/function_expression.py", line 129, in evaluate
    outcomes = self._apply_function_expression(func, batch, **kwargs)
  File "/home/hershd23/Desktop/evadb/evadb/expression/function_expression.py", line 188, in _apply_function_expression
    return func_args.apply_function_expression(func)
  File "/home/hershd23/Desktop/evadb/evadb/models/storage/batch.py", line 173, in apply_function_expression
    return Batch(expr(self._frames))
  File "/home/hershd23/Desktop/evadb/evadb/udfs/abstract/abstract_udf.py", line 36, in __call__
    return self.forward(args[0])
  File "/home/hershd23/Desktop/evadb/evadb/udfs/ludwig.py", line 33, in forward
    predictions, _ = self.model.predict(frames, return_type=pd.DataFrame)
  File "/home/hershd23/Desktop/evadb/env/lib/python3.10/site-packages/ludwig/api.py", line 895, in predict
    predictions = predictor.batch_predict(
  File "/home/hershd23/Desktop/evadb/env/lib/python3.10/site-packages/ludwig/models/predictor.py", line 142, in batch_predict
    preds = self._predict(batch)
  File "/home/hershd23/Desktop/evadb/env/lib/python3.10/site-packages/ludwig/models/predictor.py", line 188, in _predict
    outputs = self._predict_on_inputs(inputs)
  File "/home/hershd23/Desktop/evadb/env/lib/python3.10/site-packages/ludwig/models/predictor.py", line 324, in _predict_on_inputs
    return self.dist_model(inputs)
  File "/home/hershd23/Desktop/evadb/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/hershd23/Desktop/evadb/env/lib/python3.10/site-packages/ludwig/models/ecd.py", line 136, in forward
    combiner_outputs = self.combine(encoder_outputs)
  File "/home/hershd23/Desktop/evadb/env/lib/python3.10/site-packages/ludwig/models/ecd.py", line 81, in combine
    return self.combiner(encoder_outputs)
  File "/home/hershd23/Desktop/evadb/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/hershd23/Desktop/evadb/env/lib/python3.10/site-packages/ludwig/combiners/combiners.py", line 451, in forward
    hidden, aggregated_mask, masks = self.tabnet(hidden)
  File "/home/hershd23/Desktop/evadb/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/hershd23/Desktop/evadb/env/lib/python3.10/site-packages/ludwig/modules/tabnet_modules.py", line 113, in forward
    features = self.batch_norm(features)  # [b_s, i_s]
  File "/home/hershd23/Desktop/evadb/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/hershd23/Desktop/evadb/env/lib/python3.10/site-packages/torch/nn/modules/batchnorm.py", line 171, in forward
    return F.batch_norm(
  File "/home/hershd23/Desktop/evadb/env/lib/python3.10/site-packages/torch/nn/functional.py", line 2450, in batch_norm
    return torch.batch_norm(
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument weight in method wrapper_CUDA__native_batch_norm)

This could be due to my machine configuration however I was asked to report this for further analysis

Environment

  • Python 3.10
  • OS Ubuntu 22

Are you willing to submit a PR?

  • Yes I'd like to help by submitting a PR!
@github-actions
Copy link
Contributor

github-actions bot commented Sep 1, 2023

👋 Hello @hershd23, thanks for your interest in EVA DB 🙏 Please visit our 🔮 Tutorials to get started, where you can find quickstart guides for simple tasks like Image Classification all the way to more interesting tasks like Emotion Analysis.

If this is a 🐞 Bug Report, please provide a minimum reproducible example to help us debug it.

If this is a ❓ Question, please provide as much information as possible, including dataset examples and query results.

@hershd23
Copy link
Contributor Author

hershd23 commented Sep 1, 2023

Found this

https://stackoverflow.com/questions/66091226/runtimeerror-expected-all-tensors-to-be-on-the-same-device-but-found-at-least

Maybe a to_device() param is missing for the model in case the GPU and CPU both are available for training

@xzdandy
Copy link
Collaborator

xzdandy commented Sep 2, 2023

Problem also exists on ada-01 server. However, the training work on a machine that without GPU. Worth more investigation here. Thanks Hersh for raising the issue.

@xzdandy xzdandy added Bug 🐞 EVA is not working as expected Crash 💥 EVA is crashing labels Sep 2, 2023
@xzdandy xzdandy added this to the v0.3.4 milestone Sep 2, 2023
@xzdandy
Copy link
Collaborator

xzdandy commented Sep 5, 2023

The problem has been fixed on ada-01, with a new clean install. pip install ".[dev,ludwig,qdrant]". Hi @hershd23, could you verify whether the problem has also been fixed on your personal machine?

@hershd23
Copy link
Contributor Author

hershd23 commented Sep 5, 2023

Yep checking

@hershd23
Copy link
Contributor Author

hershd23 commented Sep 5, 2023

Hmm this still isn't resolved on my machine.

Steps I did

  • Pulled from latest staging
  • Installed the packages with the command you specified
  • Re ran the model training test.

It still fails with the same message

@xzdandy
Copy link
Collaborator

xzdandy commented Sep 7, 2023

Hmm this still isn't resolved on my machine.

Steps I did

  • Pulled from latest staging
  • Installed the packages with the command you specified
  • Re ran the model training test.

It still fails with the same message

Could you post the output of pip freeze ?

@xzdandy xzdandy removed this from the v0.3.4 milestone Sep 7, 2023
@hershd23
Copy link
Contributor Author

DMed you the output file

@xzdandy xzdandy added this to the v0.3.7 milestone Sep 22, 2023
@xzdandy xzdandy removed this from the v0.3.7 milestone Sep 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug 🐞 EVA is not working as expected Crash 💥 EVA is crashing
Projects
Development

No branches or pull requests

2 participants