Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Levels 3 and 4 Profile Data #34

Merged
merged 6 commits into from
May 18, 2020

Conversation

gwaybio
Copy link
Member

@gwaybio gwaybio commented May 14, 2020

Processed with pycytominer. Also adding cell counts files. Very minor modifications made to scripts updating paths according to #32 (comment)

🎉

@gwaybio gwaybio requested a review from shntnu May 14, 2020 21:56
@gwaybio
Copy link
Member Author

gwaybio commented May 14, 2020

cc @niranjchandrasekaran wooo!

@shntnu
Copy link
Collaborator

shntnu commented May 15, 2020

Looks great! I didn't read the code, only looked at the structure of the profiles directory.

Minor edits:

  1. Remove row number in cell counts and save as CSV for consistency
  2. I am happy to change the name, but we've used _normalized_variable_selected.csv in the past (because of the name cytominer::variable_selection). I'm ok with either.
profiles/
├── 2016_04_01_a549_48hr_batch1
│   ├── SQ00014812
│   │   ├── SQ00014812.csv.gz
│   │   ├── SQ00014812_augmented.csv.gz
│   │   ├── SQ00014812_normalized.csv.gz
│   │   ├── SQ00014812_normalized_dmso.csv.gz
│   │   ├── SQ00014812_normalized_feature_select.csv.gz
│   │   └── SQ00014812_normalized_feature_select_dmso.csv.gz
...
├── cell_count
│   └── 2016_04_01_a549_48hr_batch1
│       ├── SQ00014812
│       │   └── SQ00014812_cell_count.tsv
….

I spent too much time thinking about whether we should encode the process description in the file name for

  • aggregation (mean or median)
  • normalization (robustize or standardize)
  • background distribution for normalization (dmso or plate)

I concluded that your current approach is nice. We don't want to clutter things too much here. It is poor practice to rely on the filename (beyond a point) to encode the process. It is definitely impractical to do in some cases (e.g. variable selection).

Also, I think it will be cleaner for pycytominer to count cells by summing the Count_Cells column in the Image.csv files for that well (see sample Image.csv file) and then report that column as Count_Cells in the <plate>_cell_count.csv file (cell count bingo!). But I don't think this is a big deal and we can leave it as is.

FYI I did GIT_LFS_SKIP_SMUDGE=1 git fetch to download only a pointer to the lfs files

@gwaybio
Copy link
Member Author

gwaybio commented May 15, 2020

  1. Remove row number in cell counts and save as CSV for consistency

Code updated in bee9e79 - i will run the pipeline again (only the first step of counting cells, and stop it there) to recreate the cell count csv files and make sure that the updated paths work. This commit will come in at some point later today.

  1. I am happy to change the name, but we've used _normalized_variable_selected.csv in the past (because of the name cytominer::variable_selection). I'm ok with either.
    We don't want to clutter things too much here. It is poor practice to rely on the filename (beyond a point) to encode the process. It is definitely impractical to do in some cases (e.g. variable selection).

I agree 💯 the current approach is very readable and I am ok to keep it as is since it would take a lot of time for minimal benefit.

Also, I think it will be cleaner for pycytominer to count cells by summing the Count_Cells column in the Image.csv files for that well (see sample Image.csv file) and then report that column as Count_Cells in the _cell_count.csv file (cell count bingo!). But I don't think this is a big deal and we can leave it as is.

This is great, thank you for the note! I agree that this proposal is a more elegant solution. Let's reserve it for a future enhancement - I describe it in cytomining/pycytominer/issues/80

FYI I did GIT_LFS_SKIP_SMUDGE=1 git fetch to download only a pointer to the lfs files

👍 good to know this is a specific option. In the past, I have noticed that some users of datasets embedded as git LFS files struggle to access (i think if git LFS isn't installed, pointers will be downloaded by default). We should be explicit about download instructions. I've added a note to do this in #35

@gwaybio
Copy link
Member Author

gwaybio commented May 15, 2020

This commit will come in at some point later today.

I lied - will not come until next week. It is going to take some time to process the count files again with the changes specified in bee9e79 (I do agree that these changes are necessary fwiw)

@shntnu - how do you feel about merging the PR with just profiles?

The next steps after merging, as I see them are:

  1. Add a notebook to generate consensus perturbation signatures (MODZ and median)
  2. Update cell health submodule (Adding LINCS repo as a submodule cell-health#125) - I am eager to finalize figure 4!
  3. Add cell count files after rerunning processing (only cell count step, exiting after)
  4. Update whitening procedure in pycytominer and add whitened profiles

@shntnu
Copy link
Collaborator

shntnu commented May 15, 2020

@shntnu - how do you feel about merging the PR with just profiles?

Sounds good to me.

The next steps after merging, as I see them are:

  1. Add a notebook to generate consensus perturbation signatures (MODZ and median)

👍

  1. Update cell health submodule (broadinstitute/cell-health#125) - I am eager to finalize figure 4!

👍 (I did not look at the PR)

  1. Add cell count files after rerunning processing (only cell count step, exiting after)

👍

  1. Update whitening procedure in pycytominer and add whitened profiles

👍

Could you please double check that all this, especially 1 and 4, lines up with the plan here #4 (comment)?

@gwaybio
Copy link
Member Author

gwaybio commented May 15, 2020

Could you please double check that all this, especially 1 and 4, lines up with the plan here #4 (comment)?

Confirmed and reorganized thoughts into github project:
https://github.com/broadinstitute/lincs-cell-painting/projects/1

@gwaybio
Copy link
Member Author

gwaybio commented May 18, 2020

if everything looks good @shntnu can you approve the PR? I will work on consensus signatures next

@gwaybio gwaybio merged commit e0d5e81 into broadinstitute:master May 18, 2020
@gwaybio gwaybio deleted the add-profile-data branch May 18, 2020 20:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants