-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scPRINT is hanging forever #3
Comments
The code that the model is running is flash attention 2. It is not a dependency but part of the model. ScPRINT does it through triton. I have never tested scPRINT on 11.4.. Finally to test it, you should set the input context to 200 and the minibatch size to 1 to check what happens.. maybe it is not using the GPU. (These are parameters of the embedder class) To make sure that this is due to triton, you can run the model with regular attention by doing: |
Hi @jkobject It seems like GPU is being used according to the outputs by calling Embedder. >>> embedder = Embedder(
... # can work on random genes or most variables etc..
... how="random expr",
... # number of genes to use
... max_len=4000,
... add_zero_genes=0,
... # for the dataloading
... num_workers=8,
... # we will only use the cell type embedding here.
... pred_embedding = ["cell_type_ontology_term_id"]
... )#, "disease_ontology_term_id"])
Using 16bit Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs I also just ran adata, metrics = embedder(model, adata, cache=False, output_expression="none")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: Embedder.__call__() got an unexpected keyword argument 'output_expression' |
Ok I ran it with pdb. And error message is (Pdb) embedder = Embedder(how="random expr", batch_size=1, max_len=200, add_zero_genes=0, num_workers=1, pred_embedding = ["cell_type_ontology_term_id"])
Using 16bit Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
(Pdb) adata, metrics = embedder(model, adata, cache=False)
0%| | 0/83451 [00:00<?, ?it/s]
/tmp/tmpitjn_6yl/main.c:6:23: fatal error: stdatomic.h: No such file or directory
#include <stdatomic.h>
^
compilation terminated.
0%| | 0/83451 [00:00<?, ?it/s]
*** subprocess.CalledProcessError: Command '['/bin/gcc', '/tmp/tmpitjn_6yl/main.c', '-O3', '-shared', '-fPIC', '-o', '/tmp/tmpitjn_6yl/cuda_utils.cpython-310-x86_64-linux-gnu.so', '-lcuda', '-L/directflow/SCCGGroupShare/projects/kavkri/.conda/envs/scprint-env/lib/python3.10/site-packages/triton/backends/nvidia/lib', '-L/lib64', '-L/lib', '-I/directflow/SCCGGroupShare/projects/kavkri/.conda/envs/scprint-env/lib/python3.10/site-packages/triton/backends/nvidia/include', '-I/tmp/tmpitjn_6yl', '-I/directflow/SCCGGroupShare/projects/kavkri/.conda/envs/scprint-env/include/python3.10']' returned non-zero exit status 1. |
for the first message, this is because >>> embedder = Embedder(
... # can work on random genes or most variables etc..
... how="random expr",
... # number of genes to use
... max_len=4000,
... add_zero_genes=0,
... # for the dataloading
... num_workers=8,
... # we will only use the cell type embedding here.
... pred_embedding = ["cell_type_ontology_term_id"]
... output_embeddier="none" #default value, can be dropped
... ) |
for the second part, the error is quite cryptic. seeing GPU: used, just means the model will try to use it but not that it succeeded in doing so. commands like By using pdb, I wanted to know at which line exactly in the embedder.call function does the code hangs? Have you tried running a test version without flashattention as I mention in my second comment? Seeing your new error here |
I will mark this issue as closed since I received no replies in the past month. |
Hi jkobject, I am facing the same issue from the
I believe that it comes from the
I confirm that I can run |
Hey, thanks for the update and information! Sorry not to have replied earlier, I was in vacation for a week. Right so first, triton and cuda can be a pain to work with but it is better than installing flashattention from scratch.. So I remember I could make it work initially on pytorch 2.0 with cu11.7 with triton 2.0.0.mlir-something with was the version before mlir which was causing an issue with the flashattention version of triton. first things first, CUDA 11.7 is starting to be quite old now and you will miss out on many pytorch feature if you don't update to at least 11.8. Otherwise you will have to try out to make it work as is. First you might want to use 2.0.0 instead of 2.0.1, look at triton and try to run small triton kernels. then try to use the flashattention triton kernel and see what causes a bug. Last time I had it, it was working... if I can't, I will write that scPRINT is only compatible with cuda >= 11.8 |
I downloaded the checkpoints from hugging face and loaded them. I am up to the embedder step in this tutorial https://github.com/jkobject/scPRINT/blob/main/docs/notebooks/cancer_usecase.ipynb
I first ran
Then I ran with "output_expression" parameter removed. However it stops and automatically quits my python terminal. (I am running python interactively inside a conda env). I am wondering if this is a memory issue (currently using 1 GPU with 128GB). Should I try increasing the memory?
Originally posted by @kavithakrishna1 in jkobject#9 (comment)
The text was updated successfully, but these errors were encountered: