Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

246 update python functions to adhere to simpler standards and pre format data #269

Open
wants to merge 24 commits into
base: main
Choose a base branch
from

Conversation

ymahlich
Copy link
Collaborator

Addresses all points raised in #246

Refactor / rewrite of DatasetLoader into Dataset

cd.Dataset contains the following functions:

  • dataset.format(data_type, ...): returns a formatted version of the datatype of interest. Currently supports ('transcriptomics', 'mutations', 'copy_number', 'proteomics', 'experiments', 'combinations', 'drug_descriptor', 'drugs', 'genes', 'samples')
  • dataset.train_test_validate(args): splits the dataset into train, test & validation sets, and returns a @dataclass Split object containing all three datasets
  • dataset.types(): returns a list of data types present in the dataset
  • dataset.save(): saves the dataset object into a pickle file

rewrite / addition of functions in coderdata:

  • cd.download(name, ...): refactor of cd.download_by_prefix that also allows for local_path & exist_ok arguments defining the directory the files should be downloaded into, and whether they should be overwritten if they already exist.
  • cd.load(name, ...): returns a cd.Dataset object based on the dataset name given as argument. Also accepts parameters directory (defines which directory contains the datafiles to be loaded) & from_pickle (determines if a pickled Dataset, e.g. from Dataset.save() should be loaded).
  • cd.list(): returns a list of available datasets (based on datasets.yml in the root directory of coderdata - potentially will be changed to a yaml that is stored in figshare in future builds)
  • cd.version(): returns version strings of the package and dataset

…in `cd.Dataset.train_test_validate()` but can also be called as standalone function (`cd.train_test_validate()`)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update python functions to adhere to simpler standards and pre-format data
1 participant