Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cell Line Datasets Separated, Improved Error Handling for all Build Scripts #237

Merged
merged 33 commits into from
Nov 12, 2024

Conversation

jjacobson95
Copy link
Collaborator

@jjacobson95 jjacobson95 commented Oct 21, 2024

Ready to merge.

Multiple Changes will be coming with this PR.

Summary:

@sgosline
Copy link
Member

Is there a way to generalize the validation script so we only need one for all datasets? It feels like this is redundant and increases the burden on anyone wanting to add a dataset. Since all the files have the same file name structure ([datasetname]_[datatype]), can we just iterate through a directory and validate each one according to the file name?

@jjacobson95
Copy link
Collaborator Author

Yes, that is a more maintainable approach, I'll make that change!

Likely will use a decent amount of hard coded file names to ensure that all expected files are present (as is currently done) rather than simple pattern matching.

@sgosline
Copy link
Member

I'd avoid hard-coded files. Perhaps use a config file that can be updated so the validation code itself remains the same.

@jjacobson95
Copy link
Collaborator Author

Good call, that sounds more elegant

@sgosline
Copy link
Member

sgosline commented Nov 2, 2024

Please merge #242 before this.

@jjacobson95
Copy link
Collaborator Author

Merged. I've got one bug that I'm working on (The broad_sanger data split is working in the individual dataset script, but not in the build_all script) and then I'll rebuild all of the data.

@jjacobson95
Copy link
Collaborator Author

Just a note, I'm running into consistent memory issues when splitting the broad_sanger data in docker on AWS. I'm reworking this script to reduce the memory usage.

@jjacobson95
Copy link
Collaborator Author

Resolves #227. This is the most notable update.
Resolves #233, #234, #236 (with a schema fix to allow "Any"), #238, and #240. These are all bug and schema updates.
-We never made this an issue, but this moves all of the schema checking scripts into a single unified script that runs schema checking in parallel. Old schema code has been removed.

Closes #239. Not because this was resolved, but because it is a non-issue.

@jjacobson95
Copy link
Collaborator Author

Ready to merge!

@sgosline
Copy link
Member

I reviewed this, and it looks solid!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants