Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature selected files available, with more row annotations included #48

Merged
merged 11 commits into from
Mar 21, 2021

Conversation

shntnu
Copy link
Collaborator

@shntnu shntnu commented May 30, 2020

@shntnu shntnu linked an issue May 30, 2020 that may be closed by this pull request
@shntnu

This comment has been minimized.

@shntnu
Copy link
Collaborator Author

shntnu commented May 30, 2020

Simpler test:

No diffs when uncompressed

$ diff <(gzcat 2016_04_01_a549_48hr_batch1/2016_04_01_a549_48hr_batch1_consensus_median.csv.gz) <(gzcat 2016_04_01_a549_48hr_batch1_shsingh/2016_04_01_a549_48hr_batch1_consensus_median.csv.gz)

But binaries differ

$ diff 2016_04_01_a549_48hr_batch1/2016_04_01_a549_48hr_batch1_consensus_median.csv.gz 2016_04_01_a549_48hr_batch1_shsingh/2016_04_01_a549_48hr_batch1_consensus_median.csv.gz
Binary files 2016_04_01_a549_48hr_batch1/2016_04_01_a549_48hr_batch1_consensus_median.csv.gz and 2016_04_01_a549_48hr_batch1_shsingh/2016_04_01_a549_48hr_batch1_consensus_median.csv.gz differ

and that's because of the date

$ file 2016_04_01_a549_48hr_batch1/2016_04_01_a549_48hr_batch1_consensus_median.csv.gz 2016_04_01_a549_48hr_batch1_shsingh/2016_04_01_a549_48hr_batch1_consensus_median.csv.gz
2016_04_01_a549_48hr_batch1/2016_04_01_a549_48hr_batch1_consensus_median.csv.gz:         gzip compressed data, was "2016_04_01_a549_48hr_batch1_consensus_median.csv", 
last modified: Fri May 22 12:56:15 2020, max compression, original size 157348070
2016_04_01_a549_48hr_batch1_shsingh/2016_04_01_a549_48hr_batch1_consensus_median.csv.gz: gzip compressed data, was "2016_04_01_a549_48hr_batch1_consensus_median.csv", 
last modified: Sat May 30 00:14:25 2020, max compression, original size 157348070

There must be some git attribute setting to address this

@shntnu shntnu changed the title Add GCT file, rerun notebook Consensus_modz file available in GCT format May 30, 2020
@shntnu
Copy link
Collaborator Author

shntnu commented May 30, 2020

One solution is to specify the diff rule for gz files in your ~/.gitconfig (below) and use that in the .gitattributes (as I have done in this PR):

[diff "gzip"]
  binary = true
  textconv = /usr/bin/gzcat

@shntnu
Copy link
Collaborator Author

shntnu commented May 30, 2020

@gwaygenomics I just realized that you must have encountered this issue, right? Do you explicitly check for diffs rather than relying on git e.g. diff <(gunzip -c a.csv.gz) <(gunzip -c b.csv.gz)

@shntnu
Copy link
Collaborator Author

shntnu commented May 30, 2020

(there's an additional, separate issue related to this cytomining/pycytominer#82 which affects the GCT file)

@gwaybio
Copy link
Member

gwaybio commented May 30, 2020

I just realized that you must have encountered this issue, right?

Woah, never knew this was a thing. Thanks for digging into it! So, if I'm understanding correctly, the md5sum for the file we provided CLUE will fail to validate?

@shntnu
Copy link
Collaborator Author

shntnu commented May 30, 2020

Woah, never knew this was a thing. Thanks for digging into it! So, if I'm understanding correctly, the md5sum for the file we provided CLUE will fail to validate?

Nope, that aspect of it is fine.

It is actually a somewhat minor but annoying issue.

If you rerun any notebook that produces a file and then gzips it, git will detect the .gz as a modified file because the hashes are different (because the dates of the compressed data are different, even if the contents the file are identical). See below.

We now have a fix! (old notes are collapsed below). We need to use gzip -n (don't save original file name or time stamp) and we are all set!

mkdir gittest
cd gittest/
git init .
# Initialized empty Git repository in /private/tmp/gittest/.git/


echo a b c > x.txt
file x.txt
# x.txt: ASCII text
gzip -n x.txt
file x.txt.gz
# x.txt.gz: gzip compressed data, from Unix, original size 6
md5sum x.txt.gz
# ca04a6662ec96a20339f793db203b9c6  x.txt.gz


git add x.txt.gz
git commit -m "add file"
# [master (root-commit) 540f687] add file
#  1 file changed, 0 insertions(+), 0 deletions(-)
#  create mode 100644 x.txt.gz

echo a b c > x.txt
file x.txt
# x.txt: ASCII text
gzip -n x.txt
# x.txt.gz already exists -- do you wish to overwrite (y or n)? y
file x.txt.gz
# x.txt.gz: gzip compressed data, from Unix, original size 6
md5sum x.txt.gz
# ca04a6662ec96a20339f793db203b9c6  x.txt.gz

git status
# On branch master
# nothing to commit, working tree clean
Click to expand old notes
$ mkdir gittest
$ cd gittest/
$ git init .
Initialized empty Git repository in /private/tmp/gittest/.git/


$ echo a b c > x.txt
$ file x.txt
x.txt: ASCII text
$ gzip x.txt
$ file x.txt.gz
x.txt.gz: gzip compressed data, was "x.txt", last modified: Sat May 30 14:07:46 2020, from Unix, original size 6
$ md5sum x.txt.gz
3ab6b3c300d2e33106f6aa13afe23a60  x.txt.gz


$ git add x.txt.gz
$ git commit -m "add file"
[master (root-commit) 540f687] add file
 1 file changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 x.txt.gz


$ echo a b c > x.txt
$ file x.txt
x.txt: ASCII text
$ gzip x.txt
x.txt.gz already exists -- do you wish to overwrite (y or n)? y
$ file x.txt.gz
x.txt.gz: gzip compressed data, was "x.txt", last modified: Sat May 30 14:08:39 2020, from Unix, original size 6
$ md5sum x.txt.gz
b7fed9241609e14b7278bf486dece3fd  x.txt.gz


$ git diff
$ git status
On branch master
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
    modified:   x.txt.gz

no changes added to commit (use "git add" and/or "git commit -a")
$ git diff x.txt.gz

diff --git a/x.txt.gz b/x.txt.gz
index 8a132e5..6c6122d 100644
Binary files a/x.txt.gz and b/x.txt.gz differ


$ cp x.txt.gz x.txt.gz.new
$ git stash
Saved working directory and index state WIP on master: 540f687 add file
$ diff x.txt.gz x.txt.gz.new
Binary files x.txt.gz and x.txt.gz.new differ
$ diff <(gunzip -c x.txt.gz) <(gunzip -c x.txt.gz.new)
(no difference)

However, if we do these two things:

  1. Append this to your ~/.gitconfig
[diff "gzip"]
  binary = true
  textconv = /usr/bin/gunzip -c
  1. Modify the .gitattributes of this repo, by changing this line
*.gz filter=lfs diff=lfs merge=lfs -text

to this

*.gz filter=lfs diff=gzip merge=lfs -text

then although a git status will show that the gz files have been modified, no diffs will be reported because it will use the textconv string to diff the contents.

But this still doesn't solve the issue that a user may inadvertently do a git add . resulting in all the recreated (but not actually modified) gz files being added to the repo. More pondering.

@shntnu
Copy link
Collaborator Author

shntnu commented May 30, 2020

@gwaygenomics Please see the updates to the previous comment. We need to modify these lines appropriately so that gzip does not save the original file name nor time stamp (mimic as gzip -n) and then we are all set.

  1. https://github.com/cytomining/pycytominer/blob/dd064c2185435e19541bafa7c976d55da15cf09e/pycytominer/cyto_utils/output.py#L56-L62

  2. consensus_df.to_csv(
    consensus_file, sep=",", compression="gzip", float_format="%5g", index=False
    )

Unfortunately pandas does not currently support specifying options if the compression method is gzip, so you'd first need to save as CSV and then compress

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html

Revert git diff because we no longer need that
@shntnu
Copy link
Collaborator Author

shntnu commented May 31, 2020

What a coincidence!
pandas-dev/pandas#28103

For now, I suggest we take the non-pretty approach from Dan's snippet to address this issue:
pandas-dev/pandas#28103 (comment)

@gwaybio
Copy link
Member

gwaybio commented May 31, 2020

For now, I suggest we take the non-pretty approach from Dan's snippet to address this issue:

Awesome - great we could track this down! Will this change be made in this PR?

@shntnu
Copy link
Collaborator Author

shntnu commented May 31, 2020

Awesome - great we could track this down! Will this change be made in this PR?

Can this line

consensus_df.to_csv(
consensus_file, sep=",", compression="gzip", float_format="%5g", index=False
)

can be replaced with this?

cyto_utils.output(
    df=consensus_df,
    output_filename=consensus_file,
    float_format="%5g",
    compression=gzip,
)

I think yes, and if so, then the changes should be made in cyto_utils.output
https://github.com/cytomining/pycytominer/blob/dd064c2185435e19541bafa7c976d55da15cf09e/pycytominer/cyto_utils/output.py#L56-L62

In that case, this PR will do 3 things, after the cyto_utils.output code is updated:

  1. Update environment.yml to point to the version of pycytominer that has the new cyto_utils.output
  2. A bash script that gunzips all the .gz files in this repo and then runs gzip -n on each
  3. Update build-consensus-signatures.{py,ipynb} to use cyto_utils.output
  4. Re-run build-consensus-signatures.ipynb for neatness

Note that I recommend doing (2) instead of running all the scripts again (except the consensus one because that's easy) – that's too much work!

Bash script

# create gz,csv pairs
for gzip_file in `find . -name "*.csv.gz" `; 
do 
  echo $gzip_file,${gzip_file%.*}  
done > /tmp/rename.csv

# re-zip the file
parallel -a /tmp/rename.csv -C "," -L 2 \
  "echo gunzip {1} && echo gzip -n {2}"

@gwaybio
Copy link
Member

gwaybio commented May 31, 2020

Yep, we can make that update - indeed i do think it will improve pycytominer.cyto_utils.output.

In general, though, we want to avoid processing version 1 lincs-cell-painting with multiple pycytominer versions. i.e. the whole image-based-profiling pipeline was processed with pycytominer@dd064c2185435e19541bafa7c976d55da15cf09e, a small change introduced here for a relatively minor improvement would bump the version. So, unless we then reprocess the whole pipeline with the updated hash, it would be tough to track down which pieces were processed with which pycytominer version.

I recommend adding this fix to a version 2 wishlist, and writing the consensus output with the proposed gzip solution in the consensus notebook.

Is there something I am overlooking here? What do you think?

@shntnu
Copy link
Collaborator Author

shntnu commented May 31, 2020

I updated my previous comment (looks like we were responding simultaneously :D).

My only additional note is that this PR can be part of Version 2; no need to produce the GCT file for Version 1 in any case. Admittedly the GCT and the CSV issue are distinct components, and this PR might be a little too bulky for our liking, but it's not terrible.

@shntnu
Copy link
Collaborator Author

shntnu commented May 31, 2020

@gwaygenomics I don't think I've fully absorbed #48 (comment). But given that this PR will go in version 2, is it ok I delay looking into this further? Or is that blocking you on your immediate goals?

@shntnu
Copy link
Collaborator Author

shntnu commented Mar 9, 2021

From #48 (comment) @gwaygenomics said

I recommend ... writing the consensus output with the proposed gzip solution in the consensus notebook.

Although pycytominer now has the fix needed cytomining/pycytominer#83 (comment), I'll do what's suggested above to avoid bumping up the version of pycytominer.

@shntnu
Copy link
Collaborator Author

shntnu commented Mar 9, 2021

This PR will do 2 things:

  1. Re-run build-consensus-signatures.ipynb for neatness
  2. Add a bash script that gunzips all the .gz files in this repo and then runs gzip -n on each
  3. Run the bash script

Bash script

# create gz,csv pairs
for gzip_file in `find . -name "*.csv.gz" `; 
do 
  echo $gzip_file,${gzip_file%.*}  
done > /tmp/rename.csv

# re-zip the file
parallel -a /tmp/rename.csv -C "," -L 2 \
  "echo gunzip {1} && echo gzip -n {2}"

@gwaybio
Copy link
Member

gwaybio commented Mar 9, 2021

a quick heads up @shntnu - I am running through the spherize operations and I noticed a couple things:

  1. In the current pycytominer version in this repo, we still use the terms "blacklist" and "whiten" (we addressed this in Change whiten nomenclature to spherize cytomining/pycytominer#102)
  2. There is no current way to alter epsilon in the spherize implementation of pycytominer.normalize.

What should we do? If we update pycytominer version, we'll likely need to reprocess everything. It'll also enable us to output gzip profiles without timestamps

@shntnu
Copy link
Collaborator Author

shntnu commented Mar 9, 2021

  • Add a bash script that gunzips all the .gz files in this repo and then runs gzip -n on each

This is not a good fix either, because it seems to introduce other, undesirable metadata (see metadata of new file)

Original file, i.e. before running the script:

file 2016_04_01_a549_48hr_batch1/2016_04_01_a549_48hr_batch1_consensus_median.csv.gz

2016_04_01_a549_48hr_batch1/2016_04_01_a549_48hr_batch1_consensus_median.csv.gz: 
gzip compressed data, was "2016_04_01_a549_48hr_batch1_consensus_median.csv", 
last modified: Tue Mar  9 14:19:07 2021, 
max compression, 
original size modulo 2^32 157348070

New file, i.e.after running the script:

file 2016_04_01_a549_48hr_batch1/2016_04_01_a549_48hr_batch1_consensus_median.csv.gz

2016_04_01_a549_48hr_batch1/2016_04_01_a549_48hr_batch1_consensus_median.csv.gz:
gzip compressed data, 
from Unix, 
original size modulo 2^32 157348070 gzip compressed data, 
unknown method, 
has CRC, 
extra field, 
has comment, 
encrypted, 
from FAT filesystem (MS-DOS, OS/2, NT), 
original size modulo 2^32 157348070

@gwaygenomics Can you run file on .csv.gz file that was created using the updated output.py cytomining/pycytominer#119 and paste it here.

If that does not have this extra metadata, it wouldn't make sense for us to include this manual gzip -nfix. Instead, I will push the changes as is (without the fix) and then actually re-run it later.

@shntnu
Copy link
Collaborator Author

shntnu commented Mar 9, 2021

Actually, coming to think of it, there's no upside to including the fix, because our next version will anyway have the updated pycytominer. I'll ponder and ping again.

@shntnu
Copy link
Collaborator Author

shntnu commented Mar 9, 2021

There's literally no upside.

I verified that this command did produce any outputs.

# copy newly create files to a new folder
cp -r 2016_04_01_a549_48hr_batch1 2016_04_01_a549_48hr_batch1_shsingh

# restore old files
git stash

# do a diff on gz contents
(ls 2016_04_01_a549_48hr_batch1/*.csv.gz) | parallel basename {} | parallel   "diff <(gzcat 2016_04_01_a549_48hr_batch1/{}) <(gzcat 2016_04_01_a549_48hr_batch1_shsingh/{})"

We are all set.

In a future version, we will re-run using the update output.py. Until then, re-running the notebook (ipython scripts/nbconverted/build-consensus-signatures.py) will produce .csv.gz files that differ from the repo, because of the timestamp issue. The profiles module of this repo will have the same issue.

@shntnu shntnu marked this pull request as ready for review March 9, 2021 20:25
@shntnu shntnu requested a review from gwaybio March 9, 2021 20:25
Copy link
Member

@gwaybio gwaybio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - my only question is: do we want to perform feature selection before creating the .gct?

@shntnu
Copy link
Collaborator Author

shntnu commented Mar 9, 2021

LGTM - my only question is: do we want to perform feature selection before creating the .gct?

Ah yes, we should! Will do

@gwaybio
Copy link
Member

gwaybio commented Mar 9, 2021

oh, @shntnu - I just realized you can also address #49 in this PR! It should be a simple fix to add MOA and target variables to the replicate_cols variable

@shntnu shntnu changed the title Consensus_modz file available in GCT format Feature selected files available; consensus_modz file available in GCT format Mar 10, 2021
@shntnu shntnu requested a review from gwaybio March 10, 2021 03:20
@shntnu
Copy link
Collaborator Author

shntnu commented Mar 10, 2021

a quick heads up @shntnu - I am running through the spherize operations and I noticed a couple things:

  1. In the current pycytominer version in this repo, we still use the terms "blacklist" and "whiten" (we addressed this in cytomining/pycytominer#102)
  2. There is no current way to alter epsilon in the spherize implementation of pycytominer.normalize.

What should we do? If we update pycytominer version, we'll likely need to reprocess everything. It'll also enable us to output gzip profiles without timestamps

  1. I'm not too worried about it because we've done all the right things to fix it in the library, and it will make it into this repo in the next version anyway.

  2. Add epsilon as an option in normalize cytomining/pycytominer#128. If the default value of epsilion=1e-6 is fine, then we needn't fix that issue right now. How would we know whether it is fine or not? I suppose we can just do it very crudely and empirically for now: do the results improve similar to what we've seen in past analysis by Ted et al.?

  • If they do, then epsilion=1e-6 is fine and there's nothing to be done here.
  • If they don't improve then we need to think harder about the plan
    Update: I just noticed Adding spherized profiles #60 so you're all set to figure out whether there's anything to be done here.
  1. gzip to me is the most annoying thing :D But we can live with it using this solution Feature selected files available, with more row annotations included #48 (comment)

So in all, we don't need to bump the pycytominer version if 2. works out fine

@shntnu
Copy link
Collaborator Author

shntnu commented Mar 18, 2021

@gwaygenomics bumping to make sure you saw that it's back on your desk to review (no hurry from my end)

Edit: I just realized that you're working on the spherizing thingie, so we need to sort that out first. Ping me if it needs anything from me.

Copy link
Member

@gwaybio gwaybio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One suggestion, one small fix, and one question.

consensus/README.md Outdated Show resolved Hide resolved
@gwaybio gwaybio mentioned this pull request Mar 19, 2021
3 tasks
@shntnu shntnu changed the title Feature selected files available; consensus_modz file available in GCT format Feature selected files available, with more row annotations included Mar 20, 2021
Copy link
Member

@gwaybio gwaybio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All set on my end, but I'll let you merge in case you want to make any additional adjustments.

.gitattributes Outdated Show resolved Hide resolved
Co-authored-by: Greg Way <[email protected]>
@shntnu shntnu merged commit d471bbd into broadinstitute:master Mar 21, 2021
@shntnu shntnu deleted the gct branch March 21, 2021 10:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make consensus_modz file available in GCT format
2 participants