Trying to run 3-state (2 spot states) HMM data - getting CUDA memory error and "Iteration started with a new seed" warnings #420

zhoudan-brandeis · 2023-02-14T22:24:57Z

other details: running v1.1.17, have previously successfully run the same data with a 2-state (1 spot state) HMM model.

Reduced spot/frame batches from 10->5 and 512-> 256 and still get many iterations (hundreds! example image only shows last few) of the warning before ultimately running out of CUDA memory

ordabayevy · 2023-02-15T01:31:17Z

The program restarts the run when there are NaN values detected in the parameters. It is usually ok if it happens small number of times during the entire run.

If it happens repeatedly, like in your case, then there is something pathological. It is hard to tell if it is related to the data or the model without inspecting it deeply. Can we setup a Zoom meeting to have a closer look at this together?

ordabayevy · 2023-02-15T01:31:51Z

I also see that it has run 50800 iterations. How close it is to being converged when you look at Tensorboard?

zhoudan-brandeis · 2023-02-15T01:33:38Z

I found a workaround that I think may give you a clue: I made a new directory and put in the same data (driftlist, header, on/off spots). Since there wasn't a .tapqir folder, it seems to be running smoothly (10% and counting)

…

On Tue, Feb 14, 2023, 8:31 PM Yerdos Ordabayev ***@***.***> wrote: The program restarts the run when there are NaN values detected in the parameters. It is usually ok if it happens small number of times during the entire run. If it happens repeatedly, like in your case, then there is something pathological. It is hard to tell if it is related to the data or the model without inspecting it deeply. Can we setup a Zoom meeting to have a closer look at this together? — Reply to this email directly, view it on GitHub <#420 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/A4HEVDEEH5QMYKY2CHLTM7DWXQWXBANCNFSM6AAAAAAU4DY2OQ> . You are receiving this because you authored the thread.Message ID: ***@***.***>

ordabayevy · 2023-02-15T03:56:33Z

Oh I guess that is the reason. The name of the model file is the same for 2 and 3 states hmm models. Since you already have run 2 state model you have that one saved in the .tapqir folder. Now when you try to run 3 state hmm it loads the model file for a 2 state hmm and tries to continue from there. That's why it says iteration 50800. So running it in a different analysis folder should fix the problem.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trying to run 3-state (2 spot states) HMM data - getting CUDA memory error and "Iteration started with a new seed" warnings #420

Trying to run 3-state (2 spot states) HMM data - getting CUDA memory error and "Iteration started with a new seed" warnings #420

zhoudan-brandeis commented Feb 14, 2023 •

edited

Loading

ordabayevy commented Feb 15, 2023

ordabayevy commented Feb 15, 2023

zhoudan-brandeis commented Feb 15, 2023 via email

ordabayevy commented Feb 15, 2023 •

edited

Loading

Trying to run 3-state (2 spot states) HMM data - getting CUDA memory error and "Iteration started with a new seed" warnings #420

Trying to run 3-state (2 spot states) HMM data - getting CUDA memory error and "Iteration started with a new seed" warnings #420

Comments

zhoudan-brandeis commented Feb 14, 2023 • edited Loading

ordabayevy commented Feb 15, 2023

ordabayevy commented Feb 15, 2023

zhoudan-brandeis commented Feb 15, 2023 via email

ordabayevy commented Feb 15, 2023 • edited Loading

zhoudan-brandeis commented Feb 14, 2023 •

edited

Loading

ordabayevy commented Feb 15, 2023 •

edited

Loading