Feature selected files available, with more row annotations included #48

shntnu · 2020-05-30T00:47:01Z

Addresses Consensus profiles are missing MOA and target information #49
Feature selected files are available
Does not attempt to fix the gzip -n problem Feature selected files available, with more row annotations included #48 (comment)
Does not address Make consensus_modz file available in GCT format #47 because I decided not to fix Feature selected files available, with more row annotations included #48 (comment)

shntnu · 2020-05-30T01:42:33Z

Simpler test:

No diffs when uncompressed

$ diff <(gzcat 2016_04_01_a549_48hr_batch1/2016_04_01_a549_48hr_batch1_consensus_median.csv.gz) <(gzcat 2016_04_01_a549_48hr_batch1_shsingh/2016_04_01_a549_48hr_batch1_consensus_median.csv.gz)

But binaries differ

$ diff 2016_04_01_a549_48hr_batch1/2016_04_01_a549_48hr_batch1_consensus_median.csv.gz 2016_04_01_a549_48hr_batch1_shsingh/2016_04_01_a549_48hr_batch1_consensus_median.csv.gz
Binary files 2016_04_01_a549_48hr_batch1/2016_04_01_a549_48hr_batch1_consensus_median.csv.gz and 2016_04_01_a549_48hr_batch1_shsingh/2016_04_01_a549_48hr_batch1_consensus_median.csv.gz differ

and that's because of the date

$ file 2016_04_01_a549_48hr_batch1/2016_04_01_a549_48hr_batch1_consensus_median.csv.gz 2016_04_01_a549_48hr_batch1_shsingh/2016_04_01_a549_48hr_batch1_consensus_median.csv.gz
2016_04_01_a549_48hr_batch1/2016_04_01_a549_48hr_batch1_consensus_median.csv.gz:         gzip compressed data, was "2016_04_01_a549_48hr_batch1_consensus_median.csv", 
last modified: Fri May 22 12:56:15 2020, max compression, original size 157348070
2016_04_01_a549_48hr_batch1_shsingh/2016_04_01_a549_48hr_batch1_consensus_median.csv.gz: gzip compressed data, was "2016_04_01_a549_48hr_batch1_consensus_median.csv", 
last modified: Sat May 30 00:14:25 2020, max compression, original size 157348070

There must be some git attribute setting to address this

shntnu · 2020-05-30T04:03:57Z

Relevant links

shntnu · 2020-05-30T04:19:07Z

One solution is to specify the diff rule for gz files in your ~/.gitconfig (below) and use that in the .gitattributes (as I have done in this PR):

[diff "gzip"]
  binary = true
  textconv = /usr/bin/gzcat

shntnu · 2020-05-30T04:24:46Z

@gwaygenomics I just realized that you must have encountered this issue, right? Do you explicitly check for diffs rather than relying on git e.g. diff <(gunzip -c a.csv.gz) <(gunzip -c b.csv.gz)

shntnu · 2020-05-30T04:40:59Z

(there's an additional, separate issue related to this cytomining/pycytominer#82 which affects the GCT file)

gwaybio · 2020-05-30T12:23:37Z

I just realized that you must have encountered this issue, right?

Woah, never knew this was a thing. Thanks for digging into it! So, if I'm understanding correctly, the md5sum for the file we provided CLUE will fail to validate?

shntnu · 2020-05-30T14:17:15Z

Woah, never knew this was a thing. Thanks for digging into it! So, if I'm understanding correctly, the md5sum for the file we provided CLUE will fail to validate?

Nope, that aspect of it is fine.

It is actually a somewhat minor but annoying issue.

If you rerun any notebook that produces a file and then gzips it, git will detect the .gz as a modified file because the hashes are different (because the dates of the compressed data are different, even if the contents the file are identical). See below.

We now have a fix! (old notes are collapsed below). We need to use gzip -n (don't save original file name or time stamp) and we are all set!

mkdir gittest
cd gittest/
git init .
# Initialized empty Git repository in /private/tmp/gittest/.git/


echo a b c > x.txt
file x.txt
# x.txt: ASCII text
gzip -n x.txt
file x.txt.gz
# x.txt.gz: gzip compressed data, from Unix, original size 6
md5sum x.txt.gz
# ca04a6662ec96a20339f793db203b9c6  x.txt.gz


git add x.txt.gz
git commit -m "add file"
# [master (root-commit) 540f687] add file
#  1 file changed, 0 insertions(+), 0 deletions(-)
#  create mode 100644 x.txt.gz

echo a b c > x.txt
file x.txt
# x.txt: ASCII text
gzip -n x.txt
# x.txt.gz already exists -- do you wish to overwrite (y or n)? y
file x.txt.gz
# x.txt.gz: gzip compressed data, from Unix, original size 6
md5sum x.txt.gz
# ca04a6662ec96a20339f793db203b9c6  x.txt.gz

git status
# On branch master
# nothing to commit, working tree clean

Click to expand old notes

$ mkdir gittest
$ cd gittest/
$ git init .
Initialized empty Git repository in /private/tmp/gittest/.git/


$ echo a b c > x.txt
$ file x.txt
x.txt: ASCII text
$ gzip x.txt
$ file x.txt.gz
x.txt.gz: gzip compressed data, was "x.txt", last modified: Sat May 30 14:07:46 2020, from Unix, original size 6
$ md5sum x.txt.gz
3ab6b3c300d2e33106f6aa13afe23a60  x.txt.gz


$ git add x.txt.gz
$ git commit -m "add file"
[master (root-commit) 540f687] add file
 1 file changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 x.txt.gz


$ echo a b c > x.txt
$ file x.txt
x.txt: ASCII text
$ gzip x.txt
x.txt.gz already exists -- do you wish to overwrite (y or n)? y
$ file x.txt.gz
x.txt.gz: gzip compressed data, was "x.txt", last modified: Sat May 30 14:08:39 2020, from Unix, original size 6
$ md5sum x.txt.gz
b7fed9241609e14b7278bf486dece3fd  x.txt.gz


$ git diff
$ git status
On branch master
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
    modified:   x.txt.gz

no changes added to commit (use "git add" and/or "git commit -a")
$ git diff x.txt.gz

diff --git a/x.txt.gz b/x.txt.gz
index 8a132e5..6c6122d 100644
Binary files a/x.txt.gz and b/x.txt.gz differ


$ cp x.txt.gz x.txt.gz.new
$ git stash
Saved working directory and index state WIP on master: 540f687 add file
$ diff x.txt.gz x.txt.gz.new
Binary files x.txt.gz and x.txt.gz.new differ
$ diff <(gunzip -c x.txt.gz) <(gunzip -c x.txt.gz.new)
(no difference)

However, if we do these two things:

Append this to your ~/.gitconfig

[diff "gzip"]
  binary = true
  textconv = /usr/bin/gunzip -c

Modify the .gitattributes of this repo, by changing this line

*.gz filter=lfs diff=lfs merge=lfs -text

to this

*.gz filter=lfs diff=gzip merge=lfs -text

then although a git status will show that the gz files have been modified, no diffs will be reported because it will use the textconv string to diff the contents.

But this still doesn't solve the issue that a user may inadvertently do a git add . resulting in all the recreated (but not actually modified) gz files being added to the repo. More pondering.

shntnu · 2020-05-30T17:57:12Z

@gwaygenomics Please see the updates to the previous comment. We need to modify these lines appropriately so that gzip does not save the original file name nor time stamp (mimic as gzip -n) and then we are all set.

https://github.com/cytomining/pycytominer/blob/dd064c2185435e19541bafa7c976d55da15cf09e/pycytominer/cyto_utils/output.py#L56-L62

lincs-cell-painting/consensus/scripts/nbconverted/build-consensus-signatures.py

Lines 206 to 208 in e6852b4

    
           consensus_df.to_csv( 
        
               consensus_file, sep=",", compression="gzip", float_format="%5g", index=False 
        
           )

Unfortunately pandas does not currently support specifying options if the compression method is gzip, so you'd first need to save as CSV and then compress

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html

Revert git diff because we no longer need that

shntnu · 2020-05-31T12:07:01Z

What a coincidence!
pandas-dev/pandas#28103

For now, I suggest we take the non-pretty approach from Dan's snippet to address this issue:
pandas-dev/pandas#28103 (comment)

gwaybio · 2020-05-31T13:03:34Z

For now, I suggest we take the non-pretty approach from Dan's snippet to address this issue:

Awesome - great we could track this down! Will this change be made in this PR?

shntnu · 2020-05-31T13:36:20Z

Awesome - great we could track this down! Will this change be made in this PR?

Can this line

lincs-cell-painting/consensus/scripts/nbconverted/build-consensus-signatures.py

Lines 206 to 208 in e6852b4

    
           consensus_df.to_csv( 
        
               consensus_file, sep=",", compression="gzip", float_format="%5g", index=False 
        
           )

can be replaced with this?

cyto_utils.output(
    df=consensus_df,
    output_filename=consensus_file,
    float_format="%5g",
    compression=gzip,
)

I think yes, and if so, then the changes should be made in cyto_utils.output
https://github.com/cytomining/pycytominer/blob/dd064c2185435e19541bafa7c976d55da15cf09e/pycytominer/cyto_utils/output.py#L56-L62

In that case, this PR will do 3 things, after the cyto_utils.output code is updated:

Update environment.yml to point to the version of pycytominer that has the new cyto_utils.output
A bash script that gunzips all the .gz files in this repo and then runs gzip -n on each
Update build-consensus-signatures.{py,ipynb} to use cyto_utils.output
Re-run build-consensus-signatures.ipynb for neatness

Note that I recommend doing (2) instead of running all the scripts again (except the consensus one because that's easy) – that's too much work!

Bash script

# create gz,csv pairs
for gzip_file in `find . -name "*.csv.gz" `; 
do 
  echo $gzip_file,${gzip_file%.*}  
done > /tmp/rename.csv

# re-zip the file
parallel -a /tmp/rename.csv -C "," -L 2 \
  "echo gunzip {1} && echo gzip -n {2}"

gwaybio · 2020-05-31T13:44:35Z

Yep, we can make that update - indeed i do think it will improve pycytominer.cyto_utils.output.

In general, though, we want to avoid processing version 1 lincs-cell-painting with multiple pycytominer versions. i.e. the whole image-based-profiling pipeline was processed with pycytominer@dd064c2185435e19541bafa7c976d55da15cf09e, a small change introduced here for a relatively minor improvement would bump the version. So, unless we then reprocess the whole pipeline with the updated hash, it would be tough to track down which pieces were processed with which pycytominer version.

I recommend adding this fix to a version 2 wishlist, and writing the consensus output with the proposed gzip solution in the consensus notebook.

Is there something I am overlooking here? What do you think?

shntnu · 2020-05-31T14:05:27Z

I updated my previous comment (looks like we were responding simultaneously :D).

My only additional note is that this PR can be part of Version 2; no need to produce the GCT file for Version 1 in any case. Admittedly the GCT and the CSV issue are distinct components, and this PR might be a little too bulky for our liking, but it's not terrible.

shntnu · 2020-05-31T14:10:00Z

@gwaygenomics I don't think I've fully absorbed #48 (comment). But given that this PR will go in version 2, is it ok I delay looking into this further? Or is that blocking you on your immediate goals?

shntnu · 2021-03-09T12:33:52Z

From #48 (comment) @gwaygenomics said

I recommend ... writing the consensus output with the proposed gzip solution in the consensus notebook.

Although pycytominer now has the fix needed cytomining/pycytominer#83 (comment), I'll do what's suggested above to avoid bumping up the version of pycytominer.

shntnu · 2021-03-09T12:35:56Z

This PR will do 2 things:

~~Re-run build-consensus-signatures.ipynb for neatness~~
Add a bash script that gunzips all the .gz files in this repo and then runs gzip -n on each
Run the bash script

Bash script

# create gz,csv pairs
for gzip_file in `find . -name "*.csv.gz" `; 
do 
  echo $gzip_file,${gzip_file%.*}  
done > /tmp/rename.csv

# re-zip the file
parallel -a /tmp/rename.csv -C "," -L 2 \
  "echo gunzip {1} && echo gzip -n {2}"

gwaybio · 2021-03-09T16:26:11Z

a quick heads up @shntnu - I am running through the spherize operations and I noticed a couple things:

In the current pycytominer version in this repo, we still use the terms "blacklist" and "whiten" (we addressed this in Change whiten nomenclature to spherize cytomining/pycytominer#102)
There is no current way to alter epsilon in the spherize implementation of pycytominer.normalize.

What should we do? If we update pycytominer version, we'll likely need to reprocess everything. It'll also enable us to output gzip profiles without timestamps

shntnu · 2021-03-09T19:48:38Z

Add a bash script that gunzips all the .gz files in this repo and then runs gzip -n on each

This is not a good fix either, because it seems to introduce other, undesirable metadata (see metadata of new file)

Original file, i.e. before running the script:

file 2016_04_01_a549_48hr_batch1/2016_04_01_a549_48hr_batch1_consensus_median.csv.gz

2016_04_01_a549_48hr_batch1/2016_04_01_a549_48hr_batch1_consensus_median.csv.gz: 
gzip compressed data, was "2016_04_01_a549_48hr_batch1_consensus_median.csv", 
last modified: Tue Mar  9 14:19:07 2021, 
max compression, 
original size modulo 2^32 157348070

New file, i.e.after running the script:

file 2016_04_01_a549_48hr_batch1/2016_04_01_a549_48hr_batch1_consensus_median.csv.gz

2016_04_01_a549_48hr_batch1/2016_04_01_a549_48hr_batch1_consensus_median.csv.gz:
gzip compressed data, 
from Unix, 
original size modulo 2^32 157348070 gzip compressed data, 
unknown method, 
has CRC, 
extra field, 
has comment, 
encrypted, 
from FAT filesystem (MS-DOS, OS/2, NT), 
original size modulo 2^32 157348070

@gwaygenomics Can you run file on .csv.gz file that was created using the updated output.py cytomining/pycytominer#119 and paste it here.

If that does not have this extra metadata, it wouldn't make sense for us to include this manual gzip -nfix. Instead, I will push the changes as is (without the fix) and then actually re-run it later.

shntnu · 2021-03-09T19:49:36Z

Actually, coming to think of it, there's no upside to including the fix, because our next version will anyway have the updated pycytominer. I'll ponder and ping again.

shntnu · 2021-03-09T20:25:50Z

There's literally no upside.

I verified that this command did produce any outputs.

# copy newly create files to a new folder
cp -r 2016_04_01_a549_48hr_batch1 2016_04_01_a549_48hr_batch1_shsingh

# restore old files
git stash

# do a diff on gz contents
(ls 2016_04_01_a549_48hr_batch1/*.csv.gz) | parallel basename {} | parallel   "diff <(gzcat 2016_04_01_a549_48hr_batch1/{}) <(gzcat 2016_04_01_a549_48hr_batch1_shsingh/{})"

We are all set.

In a future version, we will re-run using the update output.py. Until then, re-running the notebook (ipython scripts/nbconverted/build-consensus-signatures.py) will produce .csv.gz files that differ from the repo, because of the timestamp issue. The profiles module of this repo will have the same issue.

gwaybio

LGTM - my only question is: do we want to perform feature selection before creating the .gct?

shntnu · 2021-03-09T20:58:01Z

LGTM - my only question is: do we want to perform feature selection before creating the .gct?

Ah yes, we should! Will do

gwaybio · 2021-03-09T22:51:58Z

oh, @shntnu - I just realized you can also address #49 in this PR! It should be a simple fix to add MOA and target variables to the replicate_cols variable

shntnu · 2021-03-10T13:01:47Z

a quick heads up @shntnu - I am running through the spherize operations and I noticed a couple things:

In the current pycytominer version in this repo, we still use the terms "blacklist" and "whiten" (we addressed this in cytomining/pycytominer#102)

There is no current way to alter epsilon in the spherize implementation of pycytominer.normalize.

What should we do? If we update pycytominer version, we'll likely need to reprocess everything. It'll also enable us to output gzip profiles without timestamps

I'm not too worried about it because we've done all the right things to fix it in the library, and it will make it into this repo in the next version anyway.
Add epsilon as an option in normalize cytomining/pycytominer#128. If the default value of epsilion=1e-6 is fine, then we needn't fix that issue right now. How would we know whether it is fine or not? I suppose we can just do it very crudely and empirically for now: do the results improve similar to what we've seen in past analysis by Ted et al.?

If they do, then epsilion=1e-6 is fine and there's nothing to be done here.
If they don't improve then we need to think harder about the plan
Update: I just noticed Adding spherized profiles #60 so you're all set to figure out whether there's anything to be done here.

gzip to me is the most annoying thing :D But we can live with it using this solution Feature selected files available, with more row annotations included #48 (comment)

So in all, we don't need to bump the pycytominer version if 2. works out fine

shntnu · 2021-03-18T23:43:08Z

@gwaygenomics bumping to make sure you saw that it's back on your desk to review (no hurry from my end)

Edit: I just realized that you're working on the spherizing thingie, so we need to sort that out first. Ping me if it needs anything from me.

gwaybio

One suggestion, one small fix, and one question.

consensus/README.md

consensus/scripts/nbconverted/build-consensus-signatures.py

gwaybio

All set on my end, but I'll let you merge in case you want to make any additional adjustments.

.gitattributes

Co-authored-by: Greg Way <[email protected]>

Add GCT file, rerun notebook

0b38fae

shntnu linked an issue May 30, 2020 that may be closed by this pull request

Make consensus_modz file available in GCT format #47

Closed

This comment has been minimized.

Sign in to view

shntnu changed the title ~~Add GCT file, rerun notebook~~ Consensus_modz file available in GCT format May 30, 2020

Handle gzip diffs

943bd53

Update .gitattributes

598e30c

Revert git diff because we no longer need that

gwaybio mentioned this pull request Jun 1, 2020

Add --no-name gzip flag to compression file output #50

Closed

gwaybio mentioned this pull request Mar 8, 2021

Adding batch 2 data #58

Merged

shntnu added 2 commits March 9, 2021 14:55

Update README and include pip

0ed2f59

typo

7661603

shntnu marked this pull request as ready for review March 9, 2021 20:25

shntnu requested a review from gwaybio March 9, 2021 20:25

gwaybio approved these changes Mar 9, 2021

View reviewed changes

shntnu added 3 commits March 9, 2021 22:15

docs

8769c38

add more columns, include feature selection

2409017

Add data files

907d031

shntnu changed the title ~~Consensus_modz file available in GCT format~~ Feature selected files available; consensus_modz file available in GCT format Mar 10, 2021

shntnu requested a review from gwaybio March 10, 2021 03:20

gwaybio approved these changes Mar 19, 2021

View reviewed changes

consensus/README.md Outdated Show resolved Hide resolved

consensus/scripts/nbconverted/build-consensus-signatures.py Outdated Show resolved Hide resolved

consensus/scripts/nbconverted/build-consensus-signatures.py Outdated Show resolved Hide resolved

gwaybio mentioned this pull request Mar 19, 2021

Adding spherized profiles #60

Merged

3 tasks

Fix typos, drop GCT

4a0de4f

shntnu changed the title ~~Feature selected files available; consensus_modz file available in GCT format~~ Feature selected files available, with more row annotations included Mar 20, 2021

shntnu mentioned this pull request Mar 20, 2021

Make consensus_modz file available in GCT format #47

Closed

Merge branch 'master' into gct

7d5ba49

gwaybio approved these changes Mar 20, 2021

View reviewed changes

.gitattributes Outdated Show resolved Hide resolved

Update .gitattributes

7836c4f

Co-authored-by: Greg Way <[email protected]>

shntnu merged commit d471bbd into broadinstitute:master Mar 21, 2021

shntnu deleted the gct branch March 21, 2021 10:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature selected files available, with more row annotations included #48

Feature selected files available, with more row annotations included #48

shntnu commented May 30, 2020 •

edited

Loading

This comment has been minimized.

shntnu commented May 30, 2020 •

edited

Loading

shntnu commented May 30, 2020 •

edited

Loading

shntnu commented May 30, 2020

shntnu commented May 30, 2020

shntnu commented May 30, 2020

gwaybio commented May 30, 2020

shntnu commented May 30, 2020 •

edited

Loading

shntnu commented May 30, 2020 •

edited

Loading

shntnu commented May 31, 2020

gwaybio commented May 31, 2020

shntnu commented May 31, 2020 •

edited

Loading

gwaybio commented May 31, 2020 •

edited

Loading

shntnu commented May 31, 2020

shntnu commented May 31, 2020 •

edited

Loading

shntnu commented Mar 9, 2021

shntnu commented Mar 9, 2021 •

edited

Loading

gwaybio commented Mar 9, 2021

shntnu commented Mar 9, 2021

shntnu commented Mar 9, 2021

shntnu commented Mar 9, 2021 •

edited

Loading

gwaybio left a comment

shntnu commented Mar 9, 2021

gwaybio commented Mar 9, 2021

shntnu commented Mar 10, 2021 •

edited

Loading

shntnu commented Mar 18, 2021 •

edited

Loading

gwaybio left a comment

gwaybio left a comment

Feature selected files available, with more row annotations included #48

Feature selected files available, with more row annotations included #48

Conversation

shntnu commented May 30, 2020 • edited Loading

This comment has been minimized.

shntnu commented May 30, 2020 • edited Loading

shntnu commented May 30, 2020 • edited Loading

shntnu commented May 30, 2020

shntnu commented May 30, 2020

shntnu commented May 30, 2020

gwaybio commented May 30, 2020

shntnu commented May 30, 2020 • edited Loading

shntnu commented May 30, 2020 • edited Loading

shntnu commented May 31, 2020

gwaybio commented May 31, 2020

shntnu commented May 31, 2020 • edited Loading

gwaybio commented May 31, 2020 • edited Loading

shntnu commented May 31, 2020

shntnu commented May 31, 2020 • edited Loading

shntnu commented Mar 9, 2021

shntnu commented Mar 9, 2021 • edited Loading

gwaybio commented Mar 9, 2021

shntnu commented Mar 9, 2021

shntnu commented Mar 9, 2021

shntnu commented Mar 9, 2021 • edited Loading

gwaybio left a comment

Choose a reason for hiding this comment

shntnu commented Mar 9, 2021

gwaybio commented Mar 9, 2021

shntnu commented Mar 10, 2021 • edited Loading

shntnu commented Mar 18, 2021 • edited Loading

gwaybio left a comment

Choose a reason for hiding this comment

gwaybio left a comment

Choose a reason for hiding this comment

shntnu commented May 30, 2020 •

edited

Loading

shntnu commented May 30, 2020 •

edited

Loading

shntnu commented May 30, 2020 •

edited

Loading

shntnu commented May 30, 2020 •

edited

Loading

shntnu commented May 30, 2020 •

edited

Loading

shntnu commented May 31, 2020 •

edited

Loading

gwaybio commented May 31, 2020 •

edited

Loading

shntnu commented May 31, 2020 •

edited

Loading

shntnu commented Mar 9, 2021 •

edited

Loading

shntnu commented Mar 9, 2021 •

edited

Loading

shntnu commented Mar 10, 2021 •

edited

Loading

shntnu commented Mar 18, 2021 •

edited

Loading