-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cell Line Datasets Separated, Improved Error Handling for all Build Scripts #237
Conversation
…ema checks. Now time to debug.
…de and propagate from python/R to bash to docker to local
Is there a way to generalize the validation script so we only need one for all datasets? It feels like this is redundant and increases the burden on anyone wanting to add a dataset. Since all the files have the same file name structure ([datasetname]_[datatype]), can we just iterate through a directory and validate each one according to the file name? |
Yes, that is a more maintainable approach, I'll make that change! Likely will use a decent amount of hard coded file names to ensure that all expected files are present (as is currently done) rather than simple pattern matching. |
I'd avoid hard-coded files. Perhaps use a config file that can be updated so the validation code itself remains the same. |
Good call, that sounds more elegant |
this will allow for increased range in curve fitting.
Please merge #242 before this. |
Merged. I've got one bug that I'm working on (The broad_sanger data split is working in the individual dataset script, but not in the build_all script) and then I'll rebuild all of the data. |
Just a note, I'm running into consistent memory issues when splitting the broad_sanger data in docker on AWS. I'm reworking this script to reduce the memory usage. |
Resolves #227. This is the most notable update. Closes #239. Not because this was resolved, but because it is a non-issue. |
Ready to merge! |
I reviewed this, and it looks solid! |
Ready to merge.
Multiple Changes will be coming with this PR.
Summary:
Broad_sanger is separated into all of its component datasets. These include:
This separation process includes several steps, that listed here to address Break down broad_sanger datasets into individual datasets with no duplicates #227:
process_misc
step. This is only called with broad_sanger but maybe other future datasets will require an additional final processing step as well.Error Handling improved. Every bash script in the build process has been modified to appropriately pass an exit code to docker when its R or Python scripts fail.
Schema Checking revamped. All of the schema checking code has been moved into three files: schema/coderdata.yaml, schema/expected_files.yaml, scripts/check_schema.py.
Multiple Bug fixes and code updates to align to schema. These include Validation ending with a success message even when failing. #233, Should the schema be modified to allow "inf" values in dose_response_value? #234, NCI60 - Several incorrect values in chem_name #236, NCI60 Polars Version Error #238, Add more robustness in Synapse file download scripts. #239, Schema Issues Found #240.