R scripts don't produce same RDS files #31

pavlin-policar · 2018-07-07T16:51:30Z

I've noticed that downloading the raw data and running the provided R scripts for the data sets to generate the RDS files outputs different results than what the ones on the website. Specifically, the logcounts don't match up everywhere.

I discovered this while looking through the scmap paper and going through the data sets. This problem only occurs for the data sets with CPM normalization (in Supplementary Table 2 from scmap) namely: Goolam, Li, Kolodziejczyk, Baron, Segerstolpe, Klein, Zeisel, Shekhar and Macosko.

I've gone through how the logcounts are computed in create_sce.R, but it doesn't match up with the actual results. For example, take the Li data set

> log2(calculateCPM(sceset, use_size_factors = FALSE) + 1)[1:5, 1:2]
         RHA015__A549__turquoise RHA016__A549__turquoise
TSPAN6                  9.009166               9.2430517
TNMD                    0.000000               0.0000000
DPM1                    4.888420               6.9251494
SCYL3                   0.000000               6.7888400
C1orf112                0.000000               0.6565277

while loading li.rds available at https://hemberg-lab.github.io/scRNA.seq.datasets/human/tissues/ gives

> logcounts(li)[1:5, 1:2]
         RHA015__A549__turquoise RHA016__A549__turquoise
TSPAN6                 10.352333               10.586473
TNMD                    0.000000                0.000000
DPM1                    6.203446                8.262799
SCYL3                   0.000000                8.125773
C1orf112                0.000000                1.300885

How were these logcounts computed?

The text was updated successfully, but these errors were encountered:

wikiselev · 2018-07-11T09:45:50Z

Hi Pavlin, thanks for your question. All of the datasets were created from the scripts provided in the repository. In fact, this was done automatically, I wasn't involved. The only reason for this being different I see that it maybe a new version of R/Bioconductor/scater/limma package, in which calculateCPM function has been updated. calculateCPM I think is from the limma package. Last time the datasets on our website were generated on February 27, 2018, which was quite a long time ago, many things have been updated since then. So, my recommendation would be to use the scripts with the newest versions of all of the packages. You will have slightly different numbers, but I am sure the final results won't change dramatically. Hope this helps.

ForrestCKoch · 2019-10-09T00:08:24Z

I am experiencing the same issue. I don't think it's the calculateCPM function (which comes from scater). This is a pretty straightforward transformation (equivalent to t(t(counts)*1e6/colSums(counts))) and I can't find any record of it changing in the scater documentation.

All of the datasets were created from the scripts provided in the repository. In fact, this was done automatically, I wasn't involved.

I'm unable to run your scripts without the changes in ForrestCKoch@09074c1. Are these the same scripts you used?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

R scripts don't produce same RDS files #31

R scripts don't produce same RDS files #31

pavlin-policar commented Jul 7, 2018

wikiselev commented Jul 11, 2018

ForrestCKoch commented Oct 9, 2019 •

edited

Loading

R scripts don't produce same RDS files #31

R scripts don't produce same RDS files #31

Comments

pavlin-policar commented Jul 7, 2018

wikiselev commented Jul 11, 2018

ForrestCKoch commented Oct 9, 2019 • edited Loading

ForrestCKoch commented Oct 9, 2019 •

edited

Loading