Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in hclust(parallelDist::parDist(t(CNA_mtx), threads = par_cores, : size cannot be NA nor exceed 65536 #69

Open
Ilarius opened this issue Jun 21, 2023 · 7 comments

Comments

@Ilarius
Copy link

Ilarius commented Jun 21, 2023

Hello, if I try to run this in parallel on a cluster with slurm I get a *** caught bus error ***, even if I give enough memory.

I tried with just one core but i get the following error:

results <- SCEVAN::multiSampleComparisonClonalCN(listCountMtx, analysisName = "ovarian", organism = "human" , par_cores = 1, plotTree = TRUE)

[1] " raw data - genes: 36601 cells: 71634"
[1] "1) Filter: cells > 200 genes"
[1] "low data quality"
[1] "2) Filter: genes > 5% of cells"
[1] "8286 genes past filtering"
[1] "3) Annotations gene coordinates"

Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

Loading required package: doParallel
Loading required package: foreach
Loading required package: iterators
Loading required package: parallel
[1] "found 30 confident non malignant cells"
[1] "7537 genes annotated"
[1] "4) Filter: genes involved in the cell cycle"
[1] "7123 genes past filtering "
[1] "5)  Filter: cells > 5genes per chromosome "
[1] "6) Log Freeman Turkey transformation"
[1] "A total of 67300 cells, 7123 genes after preprocessing"
[1] "7) Measuring baselines (confident normal cells)"
[1] "8) Smoothing data"
[1] "9) Segmentation (VegaMC)"
[1] "10) Adjust baseline"
Error in hclust(parallelDist::parDist(t(CNA_mtx), threads = par_cores,  : 
  size cannot be NA nor exceed 65536
Calls: <Anonymous> ... lapply -> FUN -> pipelineCNA -> classifyTumorCells -> hclust
Execution halted

any cues?

@AntonioDeFalco
Copy link
Owner

Hi @Ilarius,
Help me understand what kind of data this happens with, I see that you are using multi-sample analysis but I see that when analysing each individual sample in your listCountMtx you have a sample with 71634 initial cells how come?

@Ilarius
Copy link
Author

Ilarius commented Jun 22, 2023

It's ovarian cancer: first sample has 71634 initial cells and the second one 73644. That's because I only load a matrix with cells with at least 200 features otherwise I have to allocate a 1Tb vector in R!

In the end I used the final filtered matrix (more or less 10k each) and the same code worked. I get that the cells that I thought to be more likely tumoral (given some markers) are enriched in cells found as "tumoral" by your algorithm. However, also a significant proportion of blood cells (which is a minority compared to the overall cells in the experiment and should not be aneuploid) is also detected as tumoral, and this makes the results less reliable. Do you think using filtered matrix could have generated this problem? How important is to start with the unfiltered matrix?

@AntonioDeFalco
Copy link
Owner

I believe that using the filtered matrix is the correct procedure , to check for incorrectly classified cells you can view the heatmap to see if the separation was done correctly. Some errors can sometimes be caused by cells with noisier signal. You can improve the final result by passing SCEVAN more cells on which you are confident are normal cells as a parameter norm_cells .

Regards

@Ilarius
Copy link
Author

Ilarius commented Jul 21, 2023

I did not use norm cells because the documentation says: "norm_cells : Vector of normal cells if the classification is already known and you are only interested in the clonal structure (optional)".

So I know that since it is a solid tumor the tumoral cells are in the epithelial cluster, and not in the blood cell clusters.

PS. Is there somewhere the code that you use for the heatmaps and other visualization that you show in this vignette?

http://htmlpreview.github.io/?https://github.com/AntonioDeFalco/SCEVAN/blob/main/vignettes/IntratumoralHeterogeneityInGlioblastoma.html

@AntonioDeFalco
Copy link
Owner

If you know cells in the count matrix for which you are confident that are normal cells you can pass It as norm_cells parameter, It will be used to create e reference and identify all diploid cells.

All code is public you can find in this GitHub.

@ahdee
Copy link

ahdee commented Jan 12, 2024

@Ilarius just a random idea while reading through this. What about using your cell annotations ( blood cells ) as a source of "normal" cells. May be set a seed and randomely draw 2-3k cells? You mention that most likely these cells should not be cancerous? Going even further perhaps only selecting blood cells with in certain cell cycle phase and/or low expressing genes particular to the cancer type u are looking for?

@ahdee
Copy link

ahdee commented Jan 13, 2024

I believe that using the filtered matrix is the correct procedure , to check for incorrectly classified cells you can view the heatmap to see if the separation was done correctly. Some errors can sometimes be caused by cells with noisier signal. You can improve the final result by passing SCEVAN more cells on which you are confident are normal cells as a parameter norm_cells .

Regards

@AntonioDeFalco Hi it does'nt look like the function multiSampleComparisonClonalCN have the option to pass norm_cell?
I'm using version: SCEVAN_1.0.1 thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants