Cell Line Datasets Separated, Improved Error Handling for all Build Scripts #237

jjacobson95 · 2024-10-21T19:37:48Z

Ready to merge.

Multiple Changes will be coming with this PR.

Summary:

Broad_sanger is separated into all of its component datasets. These include:
- CCLE
- gCSI
- GDSCv1
- GDSCv2
- CTRPv2
- FIMM
- PRISM
This separation process includes several steps, that listed here to address Break down broad_sanger datasets into individual datasets with no duplicates #227:
- build/broad_sanger/05_separate_datasets.py script - This performs the separation script.
- build/broad_sanger/build_misc.sh - This calls the separation script in the broad_sanger_omics dockerfile.
- build/build_all.py and build/build_dataset.py were modified to have a process_misc step. This is only called with broad_sanger but maybe other future datasets will require an additional final processing step as well.
- Validation and schema test for each listed dataset and associated updated to check_all_schemas.py.
Error Handling improved. Every bash script in the build process has been modified to appropriately pass an exit code to docker when its R or Python scripts fail.
- This will address the uncaught validation error in Validation ending with a success message even when failing. #233.
- It also has some potential to affect Drug generation Uncaught Failures #125, though we need a more direct method to handle this issue.
Schema Checking revamped. All of the schema checking code has been moved into three files: schema/coderdata.yaml, schema/expected_files.yaml, scripts/check_schema.py.
Multiple Bug fixes and code updates to align to schema. These include Validation ending with a success message even when failing. #233, Should the schema be modified to allow "inf" values in dose_response_value? #234, NCI60 - Several incorrect values in chem_name #236, NCI60 Polars Version Error #238, Add more robustness in Synapse file download scripts. #239, Schema Issues Found #240.

…ema checks. Now time to debug.

…de and propagate from python/R to bash to docker to local

sgosline · 2024-10-22T15:22:26Z

Is there a way to generalize the validation script so we only need one for all datasets? It feels like this is redundant and increases the burden on anyone wanting to add a dataset. Since all the files have the same file name structure ([datasetname]_[datatype]), can we just iterate through a directory and validate each one according to the file name?

jjacobson95 · 2024-10-22T15:36:42Z

Yes, that is a more maintainable approach, I'll make that change!

Likely will use a decent amount of hard coded file names to ensure that all expected files are present (as is currently done) rather than simple pattern matching.

sgosline · 2024-10-22T15:41:26Z

I'd avoid hard-coded files. Perhaps use a config file that can be updated so the validation code itself remains the same.

jjacobson95 · 2024-10-22T15:45:00Z

Good call, that sounds more elegant

this will allow for increased range in curve fitting.

sgosline · 2024-11-02T18:38:11Z

Please merge #242 before this.

…ad_sanger

jjacobson95 · 2024-11-05T16:41:00Z

Merged. I've got one bug that I'm working on (The broad_sanger data split is working in the individual dataset script, but not in the build_all script) and then I'll rebuild all of the data.

… into split_broad_sanger

jjacobson95 · 2024-11-06T17:27:57Z

Just a note, I'm running into consistent memory issues when splitting the broad_sanger data in docker on AWS. I'm reworking this script to reduce the memory usage.

…or memory

jjacobson95 · 2024-11-11T21:17:11Z

Resolves #227. This is the most notable update.
Resolves #233, #234, #236 (with a schema fix to allow "Any"), #238, and #240. These are all bug and schema updates.
-We never made this an issue, but this moves all of the schema checking scripts into a single unified script that runs schema checking in parallel. Old schema code has been removed.

Closes #239. Not because this was resolved, but because it is a non-issue.

jjacobson95 · 2024-11-11T21:23:18Z

Ready to merge!

sgosline · 2024-11-12T00:35:02Z

I reviewed this, and it looks solid!

jjacobson95 added 5 commits October 18, 2024 12:31

Introduced splitting script. Updated build scripts. Added all new sch…

a7d0c8f

…ema checks. Now time to debug.

multiple schema check fixes

260dacf

Working on testing Seperating broad_sanger

3e5a564

Working on testing Seperating broad_sanger

2dc238c

Error traps and exit codes added. This should allow errors to stop co…

c661f7a

…de and propagate from python/R to bash to docker to local

switching from sh to bash to allow for error propagation

1eee16f

jjacobson95 and others added 16 commits October 22, 2024 10:03

Merge remote-tracking branch 'origin/main' into split_broad_sanger

649a05e

pinning polars in broad_sanger

c18a979

nci60 concat fix

39c0c6c

bug fix

d97f418

nci fix

c810335

fix

9ac6b0b

tiny bug fix

fd54b97

tiny bug fix 2

9f7520d

Tracked down issue causing beataml failure. ensembl moved/removed a file

419e389

working on mpnst fix. will add issue if this doesn't resolve it

a38702c

BeatAML schema fix

b443d16

BeatAML schema fix

895a1aa

Removed old code from beataml

bcb6bb4

allowed hill slope to be positive

b7c39b4

this will allow for increased range in curve fitting.

Testing New Schema Checker & Build Process

d0a8762

updated dose_response_value to any in schema

208e0ad

Merge remote-tracking branch 'origin/curve-fit-change' into split_bro…

3fd4b95

…ad_sanger

working on 05_separate_datasets.py bug

9ace774

EC2 Default User and others added 4 commits November 5, 2024 20:55

Merge remote-tracking branch 'refs/remotes/origin/split_broad_sanger'…

d1731c1

… into split_broad_sanger

Moving some changes from local to test on AWS. added memory debug

c28bd78

Still working on memory issues with broad_Sanger splitting

ace675d

Reverting version. Updated docker settings

9614770

jjacobson95 and others added 5 commits November 6, 2024 12:36

Full rework of broad_sanger splitter. Now uses polars and optimized f…

ec06b7d

…or memory

Schema update

640f949

Final update

8efb440

last update

20b3d36

Removed old schema checking code.

85ed415

sgosline merged commit 721b24b into main Nov 12, 2024

This was referenced Nov 12, 2024

NCI60 - Several incorrect values in chem_name #236

Open

Documentation update: Add which Datasets are contained within broad_sanger. #221

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cell Line Datasets Separated, Improved Error Handling for all Build Scripts #237

Cell Line Datasets Separated, Improved Error Handling for all Build Scripts #237

jjacobson95 commented Oct 21, 2024 •

edited

Loading

sgosline commented Oct 22, 2024

jjacobson95 commented Oct 22, 2024

sgosline commented Oct 22, 2024

jjacobson95 commented Oct 22, 2024

sgosline commented Nov 2, 2024

jjacobson95 commented Nov 5, 2024

jjacobson95 commented Nov 6, 2024

jjacobson95 commented Nov 11, 2024

jjacobson95 commented Nov 11, 2024

sgosline commented Nov 12, 2024

Cell Line Datasets Separated, Improved Error Handling for all Build Scripts #237

Cell Line Datasets Separated, Improved Error Handling for all Build Scripts #237

Conversation

jjacobson95 commented Oct 21, 2024 • edited Loading

sgosline commented Oct 22, 2024

jjacobson95 commented Oct 22, 2024

sgosline commented Oct 22, 2024

jjacobson95 commented Oct 22, 2024

sgosline commented Nov 2, 2024

jjacobson95 commented Nov 5, 2024

jjacobson95 commented Nov 6, 2024

jjacobson95 commented Nov 11, 2024

jjacobson95 commented Nov 11, 2024

sgosline commented Nov 12, 2024

jjacobson95 commented Oct 21, 2024 •

edited

Loading