Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue for getting embedding #10

Open
zxshi0102 opened this issue Oct 29, 2024 · 3 comments
Open

Issue for getting embedding #10

zxshi0102 opened this issue Oct 29, 2024 · 3 comments

Comments

@zxshi0102
Copy link

zxshi0102 commented Oct 29, 2024

Hi there,
I am trying to create the embedding layer for my antibody sequence with ABodyBuilder3-LM

use_precomputed = False
if use_precomputed:
    embedding = torch.load("/home/user/package_download/abodybuilder3/data/structures/structures_plm/6yio_H0-L0.pt")[
        "plm_embedding"
    ]
else:
    plm = ProtT5()
    embedding = plm.get_embeddings(
        [
            heavy,
        ],
        [
            light,
        ],
    )

After running this code, it shows the error:

RuntimeError                              Traceback (most recent call last)
Cell In[4], line 8
      4     embedding = torch.load("/home/user/package_download/abodybuilder3/data/structures/structures_plm/6yio_H0-L0.pt")[
      5         "plm_embedding"
      6     ]
      7 else:
----> 8     plm = ProtT5()
      9     embedding = plm.get_embeddings(
     10         [
     11             heavy,
   (...)
     15         ],
     16     )
     18 print(f"{embedding.shape=}")

File <string>:8, in __init__(self, weights_dir, model_type, paired, batch_size, device_map)

File ~/package_download/abodybuilder3/src/abodybuilder3/language/model.py:55, in ProtTrans.__post_init__(self)
     52 elif self.paired and self.model_type == "t5":
     53     self.seperator_token = "</s>"
---> 55 self.trainer = Trainer(num_nodes=1, devices=1)

File ~/package_download/abodybuilder3/.venv/lib/python3.9/site-packages/lightning/pytorch/utilities/argparse.py:70, in _defaults_from_env_vars.<locals>.insert_env_defaults(self, *args, **kwargs)
     67 kwargs = dict(list(env_variables.items()) + list(kwargs.items()))
     69 # all args were already moved to kwargs
---> 70 return fn(self, **kwargs)

File ~/package_download/abodybuilder3/.venv/lib/python3.9/site-packages/lightning/pytorch/trainer/trainer.py:401, in Trainer.__init__(self, accelerator, strategy, devices, num_nodes, precision, logger, callbacks, fast_dev_run, max_epochs, min_epochs, max_steps, min_steps, max_time, limit_train_batches, limit_val_batches, limit_test_batches, limit_predict_batches, overfit_batches, val_check_interval, check_val_every_n_epoch, num_sanity_val_steps, log_every_n_steps, enable_checkpointing, enable_progress_bar, enable_model_summary, accumulate_grad_batches, gradient_clip_val, gradient_clip_algorithm, deterministic, benchmark, inference_mode, use_distributed_sampler, profiler, detect_anomaly, barebones, plugins, sync_batchnorm, reload_dataloaders_every_n_epochs, default_root_dir)
    398 # init connectors
    399 self._data_connector = _DataConnector(self)
--> 401 self._accelerator_connector = _AcceleratorConnector(
    402     devices=devices,
    403     accelerator=accelerator,
    404     strategy=strategy,
    405     num_nodes=num_nodes,
    406     sync_batchnorm=sync_batchnorm,
    407     benchmark=benchmark,
    408     use_distributed_sampler=use_distributed_sampler,
    409     deterministic=deterministic,
    410     precision=precision,
    411     plugins=plugins,
    412 )
    413 self._logger_connector = _LoggerConnector(self)
    414 self._callback_connector = _CallbackConnector(self)

File ~/package_download/abodybuilder3/.venv/lib/python3.9/site-packages/lightning/pytorch/trainer/connectors/accelerator_connector.py:158, in _AcceleratorConnector.__init__(self, devices, num_nodes, accelerator, strategy, plugins, precision, sync_batchnorm, benchmark, use_distributed_sampler, deterministic)
    155 self._set_parallel_devices_and_init_accelerator()
    157 # 3. Instantiate ClusterEnvironment
--> 158 self.cluster_environment: ClusterEnvironment = self._choose_and_init_cluster_environment()
    160 # 4. Instantiate Strategy - Part 1
    161 if self._strategy_flag == "auto":

File ~/package_download/abodybuilder3/.venv/lib/python3.9/site-packages/lightning/pytorch/trainer/connectors/accelerator_connector.py:428, in _AcceleratorConnector._choose_and_init_cluster_environment(self)
    420 for env_type in (
    421     # TorchElastic has the highest priority since it can also be used inside SLURM
    422     TorchElasticEnvironment,
   (...)
    425     MPIEnvironment,
    426 ):
    427     if env_type.detect():
--> 428         return env_type()
    429 if _LIGHTNING_BAGUA_AVAILABLE:
    430     from lightning_bagua import BaguaEnvironment

File ~/package_download/abodybuilder3/.venv/lib/python3.9/site-packages/lightning/fabric/plugins/environments/slurm.py:52, in SLURMEnvironment.__init__(self, auto_requeue, requeue_signal)
     50 self.requeue_signal = requeue_signal
     51 self._validate_srun_used()
---> 52 self._validate_srun_variables()

File ~/package_download/abodybuilder3/.venv/lib/python3.9/site-packages/lightning/fabric/plugins/environments/slurm.py:210, in SLURMEnvironment._validate_srun_variables()
    208 print(os.environ)
    209 if ntasks > 1 and "SLURM_NTASKS_PER_NODE" not in os.environ:
--> 210     raise RuntimeError(
    211         f"You set `--ntasks={ntasks}` in your SLURM bash script, but this variable is not supported."
    212         f" HINT: Use `--ntasks-per-node={ntasks}` instead."
    213     )

RuntimeError: You set `--ntasks=64` in your SLURM bash script, but this variable is not supported. HINT: Use `--ntasks-per-node=64` instead.

And i tried to change all the ntasks to ntasks-per-node in abodybuilder3/.venv/lib/python3.9/site-packages/lightning/fabric/plugins/environments/slurm.py, it would make the package cannot be imported at the beginning. But I cannot find the other ways to fix it, could you give me some hints to fix it? Thanks!

@henrykenlay
Copy link
Contributor

Hi @zxshi0102, the error suggests you are using Slurm to launch a script or an interactive session to run this code, and that you specify --ntasks instead of --ntasks-per-node in the Slurm script. Can you check in your Slurm script for these variables?

To isolate the issue, could you try importing and initialising ProtT5 from abodybuilder3/language/model.py?

@zxshi0102
Copy link
Author

Thank you for your prompt response! I am indeed using an interactive session via jupyter notebook, the error shows only when trying to compute the embeddings, when running the prediction with the full model the command works, and I am able to produce the PDB outputs.

I have tried to import and initialize the model as suggested with

sys.path.append("/home/zshi/package_download/abodybuilder3")
from abodybuilder3.language.model import ProtT5

Unfortunately, the same error shows up。Thanks! and let me know if there is something else, I should try, otherwise I will default to using the full model.

@henrykenlay
Copy link
Contributor

I am indeed using an interactive session via jupyter notebook

Sorry, I meant an interactive session using Slurm. Your error message seems to suggest you are using Slurm, is this the case?

It seems your issue is coming from self.trainer = Trainer(num_nodes=1, devices=1). Can you try

from lightning import Trainer
Trainer(num_nodes=1, devices=1)

and see if you get the same error? Can you also try

from lightning import Trainer
Trainer()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants