Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

R scripts don't produce same RDS files #31

Open
pavlin-policar opened this issue Jul 7, 2018 · 2 comments
Open

R scripts don't produce same RDS files #31

pavlin-policar opened this issue Jul 7, 2018 · 2 comments

Comments

@pavlin-policar
Copy link

I've noticed that downloading the raw data and running the provided R scripts for the data sets to generate the RDS files outputs different results than what the ones on the website. Specifically, the logcounts don't match up everywhere.

I discovered this while looking through the scmap paper and going through the data sets. This problem only occurs for the data sets with CPM normalization (in Supplementary Table 2 from scmap) namely: Goolam, Li, Kolodziejczyk, Baron, Segerstolpe, Klein, Zeisel, Shekhar and Macosko.

I've gone through how the logcounts are computed in create_sce.R, but it doesn't match up with the actual results. For example, take the Li data set

> log2(calculateCPM(sceset, use_size_factors = FALSE) + 1)[1:5, 1:2]
         RHA015__A549__turquoise RHA016__A549__turquoise
TSPAN6                  9.009166               9.2430517
TNMD                    0.000000               0.0000000
DPM1                    4.888420               6.9251494
SCYL3                   0.000000               6.7888400
C1orf112                0.000000               0.6565277

while loading li.rds available at https://hemberg-lab.github.io/scRNA.seq.datasets/human/tissues/ gives

> logcounts(li)[1:5, 1:2]
         RHA015__A549__turquoise RHA016__A549__turquoise
TSPAN6                 10.352333               10.586473
TNMD                    0.000000                0.000000
DPM1                    6.203446                8.262799
SCYL3                   0.000000                8.125773
C1orf112                0.000000                1.300885

How were these logcounts computed?

@wikiselev
Copy link
Member

Hi Pavlin, thanks for your question. All of the datasets were created from the scripts provided in the repository. In fact, this was done automatically, I wasn't involved. The only reason for this being different I see that it maybe a new version of R/Bioconductor/scater/limma package, in which calculateCPM function has been updated. calculateCPM I think is from the limma package. Last time the datasets on our website were generated on February 27, 2018, which was quite a long time ago, many things have been updated since then. So, my recommendation would be to use the scripts with the newest versions of all of the packages. You will have slightly different numbers, but I am sure the final results won't change dramatically. Hope this helps.

@ForrestCKoch
Copy link

ForrestCKoch commented Oct 9, 2019

I am experiencing the same issue. I don't think it's the calculateCPM function (which comes from scater). This is a pretty straightforward transformation (equivalent to t(t(counts)*1e6/colSums(counts))) and I can't find any record of it changing in the scater documentation.

All of the datasets were created from the scripts provided in the repository. In fact, this was done automatically, I wasn't involved.

I'm unable to run your scripts without the changes in ForrestCKoch@09074c1. Are these the same scripts you used?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants