scPRINT is hanging forever #3

jkobject · 2024-09-19T12:56:38Z

I downloaded the checkpoints from hugging face and loaded them. I am up to the embedder step in this tutorial https://github.com/jkobject/scPRINT/blob/main/docs/notebooks/cancer_usecase.ipynb

I first ran

adata, metrics = embedder(model, adata, cache=False, output_expression="none")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: Embedder.__call__() got an unexpected keyword argument 'output_expression'

Then I ran with "output_expression" parameter removed. However it stops and automatically quits my python terminal. (I am running python interactively inside a conda env). I am wondering if this is a memory issue (currently using 1 GPU with 128GB). Should I try increasing the memory?

adata, metrics = embedder(model, adata, cache=False)
0%|                                                           | 0/1304 [00:00<?, ?it/s] 
(quits python terminal here)

Originally posted by @kavithakrishna1 in jkobject#9 (comment)

jkobject · 2024-09-19T12:56:42Z

The code that the model is running is flash attention 2. It is not a dependency but part of the model. ScPRINT does it through triton. I have never tested scPRINT on 11.4..
So, you would have to use pdb and check if the model.predict() function gets called within the embedder class. Also can you check if the GPU memory gets used?

Finally to test it, you should set the input context to 200 and the minibatch size to 1 to check what happens.. maybe it is not using the GPU. (These are parameters of the embedder class)
e.g. embedder = Embedder(batch_size=1,num_workers=1, max_len=200) and maybe use an adata of only a couple cells

To make sure that this is due to triton, you can run the model with regular attention by doing:
model = scPrint.load_from_checkpoint( ckpt_path, precpt_gene_emb=None, transformer="normal")

kavithakrishna1 · 2024-09-23T02:43:13Z

Hi @jkobject

It seems like GPU is being used according to the outputs by calling Embedder.

>>> embedder = Embedder( 
...                     # can work on random genes or most variables etc..
...                     how="random expr", 
...                     # number of genes to use
...                     max_len=4000, 
...                     add_zero_genes=0, 
...                     # for the dataloading
...                     num_workers=8, 
...                     # we will only use the cell type embedding here.
...                     pred_embedding = ["cell_type_ontology_term_id"]
...                     )#, "disease_ontology_term_id"])
Using 16bit Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs

I also just ran pip install --upgrade scprint and checked output_expression parameter, however it didn't work. Also tried pip install scprint[dev]

adata, metrics = embedder(model, adata, cache=False, output_expression="none")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: Embedder.__call__() got an unexpected keyword argument 'output_expression'

kavithakrishna1 · 2024-09-23T03:49:35Z

Ok I ran it with pdb. And error message is /tmp/tmpitjn_6yl/main.c:6:23: fatal error: stdatomic.h: No such file or directory. I am not sure where this is coming from. Any ideas?

(Pdb) embedder = Embedder(how="random expr", batch_size=1, max_len=200, add_zero_genes=0, num_workers=1, pred_embedding = ["cell_type_ontology_term_id"])
Using 16bit Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
(Pdb) adata, metrics = embedder(model, adata, cache=False)
  0%|                                                       | 0/83451 [00:00<?, ?it/s]
/tmp/tmpitjn_6yl/main.c:6:23: fatal error: stdatomic.h: No such file or directory
 #include <stdatomic.h>
                       ^
compilation terminated.
  0%|                                                       | 0/83451 [00:00<?, ?it/s]         
*** subprocess.CalledProcessError: Command '['/bin/gcc', '/tmp/tmpitjn_6yl/main.c', '-O3', '-shared', '-fPIC', '-o', '/tmp/tmpitjn_6yl/cuda_utils.cpython-310-x86_64-linux-gnu.so', '-lcuda', '-L/directflow/SCCGGroupShare/projects/kavkri/.conda/envs/scprint-env/lib/python3.10/site-packages/triton/backends/nvidia/lib', '-L/lib64', '-L/lib', '-I/directflow/SCCGGroupShare/projects/kavkri/.conda/envs/scprint-env/lib/python3.10/site-packages/triton/backends/nvidia/include', '-I/tmp/tmpitjn_6yl', '-I/directflow/SCCGGroupShare/projects/kavkri/.conda/envs/scprint-env/include/python3.10']' returned non-zero exit status 1.

jkobject · 2024-09-23T12:38:31Z

for the first message, this is because output_expression is now part of the class init function so you need to do

>>> embedder = Embedder( 
...                     # can work on random genes or most variables etc..
...                     how="random expr", 
...                     # number of genes to use
...                     max_len=4000, 
...                     add_zero_genes=0, 
...                     # for the dataloading
...                     num_workers=8, 
...                     # we will only use the cell type embedding here.
...                     pred_embedding = ["cell_type_ontology_term_id"]
...                     output_embeddier="none" #default value, can be dropped
...                     )

jkobject · 2024-09-23T12:43:42Z

for the second part, the error is quite cryptic. seeing GPU: used, just means the model will try to use it but not that it succeeded in doing so. commands like nvtop will show you GPU usage in real time.

By using pdb, I wanted to know at which line exactly in the embedder.call function does the code hangs?

Have you tried running a test version without flashattention as I mention in my second comment?

Seeing your new error here *** subprocess.CalledProcessError: Command '['/bin/gcc', '/tmp/tmpitjn_6yl/main.c', '-O3', '-shared', '-fPIC', '-o', '/tmp/tmpitjn_6yl/cuda_utils.cpython-310-x86_64-linux-gnu.so', '-lcuda', '-L/directflow/SCCGGroupShare/projects/kavkri/.conda/envs/scprint-env/lib/python3.10/site-packages/triton/backends/nvidia/lib', '-L/lib64', '-L/lib', '-I/directflow/SCCGGroupShare/projects/kavkri/.conda/envs/scprint-env/lib/python3.10/site-packages/triton/backends/nvidia/include', '-I/tmp/tmpitjn_6yl', '-I/directflow/SCCGGroupShare/projects/kavkri/.conda/envs/scprint-env/include/python3.10']' returned non-zero exit status 1.
my guess is that you have a problem with your pytorch / GPU / cuda installation and it is not related to scPRINT but I might be wrong. Have you used pytorch with your GPU before?

jkobject · 2024-10-21T07:50:19Z

I will mark this issue as closed since I received no replies in the past month.

fantashi099 · 2024-11-09T05:28:16Z

Hi jkobject,

I am facing the same issue from the embedder here:

adata, metrics = embedder(model, adata, cache=False)
0%|                                                           | 0/1304 [00:00<?, ?it/s] 
(I cannot quit python terminal, I can only kill the python process)

I believe that it comes from the triton with flashattention because the process is stuck/died after running this:

transformer_output = self.transformer(
            encoding,
            return_qkv=get_attention_layer,
            bias=bias if self.attn_bias != "none" else None,
            bias_layer=list(range(self.nlayers - 1)),
        )

I confirm that I can run embedder normally with transformer="normal". For some reason, I can only use the CUDA driver 11.7, so my pytorch version is 2.0.1-cu11.7 with triton 2.0.0.

jkobject · 2024-11-13T16:05:19Z

Hey, thanks for the update and information!

Sorry not to have replied earlier, I was in vacation for a week.

Right so first, triton and cuda can be a pain to work with but it is better than installing flashattention from scratch..

So I remember I could make it work initially on pytorch 2.0 with cu11.7 with triton 2.0.0.mlir-something with was the version before mlir which was causing an issue with the flashattention version of triton.
However this version has now disappeared from pypi it seems.

first things first, CUDA 11.7 is starting to be quite old now and you will miss out on many pytorch feature if you don't update to at least 11.8.

Otherwise you will have to try out to make it work as is. First you might want to use 2.0.0 instead of 2.0.1, look at triton and try to run small triton kernels. then try to use the flashattention triton kernel and see what causes a bug.
Make sure you have the latest version of scPRINT on the main branch. You can disassemble my embedder class too to understand this issue better. I will try to take the time and re-install cuda 11.7 and scPRINT on it.

Last time I had it, it was working... if I can't, I will write that scPRINT is only compatible with cuda >= 11.8

jkobject added bug Something isn't working help wanted Extra attention is needed labels Sep 19, 2024

jkobject closed this as completed Sep 19, 2024

jkobject reopened this Sep 19, 2024

jkobject mentioned this issue Sep 19, 2024

scPRINT is hanging forever jkobject/scPRINT#10

Closed

jkobject closed this as completed Oct 22, 2024

jkobject reopened this Nov 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scPRINT is hanging forever #3

scPRINT is hanging forever #3

jkobject commented Sep 19, 2024

jkobject commented Sep 19, 2024

kavithakrishna1 commented Sep 23, 2024

kavithakrishna1 commented Sep 23, 2024 •

edited

Loading

jkobject commented Sep 23, 2024

jkobject commented Sep 23, 2024

jkobject commented Oct 21, 2024

fantashi099 commented Nov 9, 2024

jkobject commented Nov 13, 2024

scPRINT is hanging forever #3

scPRINT is hanging forever #3

Comments

jkobject commented Sep 19, 2024

jkobject commented Sep 19, 2024

kavithakrishna1 commented Sep 23, 2024

kavithakrishna1 commented Sep 23, 2024 • edited Loading

jkobject commented Sep 23, 2024

jkobject commented Sep 23, 2024

jkobject commented Oct 21, 2024

fantashi099 commented Nov 9, 2024

jkobject commented Nov 13, 2024

kavithakrishna1 commented Sep 23, 2024 •

edited

Loading