246 update python functions to adhere to simpler standards and pre format data #269

ymahlich · 2024-12-16T20:00:16Z

Addresses all points raised in #246

Refactor / rewrite of `DatasetLoader` into `Dataset`

cd.Dataset contains the following functions:

dataset.format(data_type, ...): returns a formatted version of the datatype of interest. Currently supports ('transcriptomics', 'mutations', 'copy_number', 'proteomics', 'experiments', 'combinations', 'drug_descriptor', 'drugs', 'genes', 'samples')
dataset.train_test_validate(args): splits the dataset into train, test & validation sets, and returns a @dataclass Split object containing all three datasets
dataset.types(): returns a list of data types present in the dataset
dataset.save(): saves the dataset object into a pickle file

rewrite / addition of functions in coderdata:

cd.download(name, ...): refactor of cd.download_by_prefix that also allows for local_path & exist_ok arguments defining the directory the files should be downloaded into, and whether they should be overwritten if they already exist.
cd.load(name, ...): returns a cd.Dataset object based on the dataset name given as argument. Also accepts parameters directory (defines which directory contains the datafiles to be loaded) & from_pickle (determines if a pickled Dataset, e.g. from Dataset.save() should be loaded).
cd.list(): returns a list of available datasets (based on datasets.yml in the root directory of coderdata - potentially will be changed to a yaml that is stored in figshare in future builds)
cd.version(): returns version strings of the package and dataset

…in `cd.Dataset.train_test_validate()` but can also be called as standalone function (`cd.train_test_validate()`)

…ction

…ataset.format()`

…crosstab`

ymahlich added 24 commits November 18, 2024 15:12

first draft of Dataset / DataLoader rewrite

b14d1c3

added Dataset.types() function

2781309

addded train_test_validate as instance method

748e383

moved train_test_validate out of Dataset. Is now called from with…

8995b10

…in `cd.Dataset.train_test_validate()` but can also be called as standalone function (`cd.train_test_validate()`)

added dataset.save() function

f689d00

added option to load from pickled object file to dataset.load() fun…

b822acd

…ction

added skeleton for dataset.format()

37d5f90

added "mutations" data_type to dataset.format()

e899ec9

added handeling of 'combinations', 'drugs', 'genes' & 'samples' in `d…

68a5719

…ataset.format()`

added basic handling of 'proteomics' to dataset.format()

50e4ee2

added handeling of 'transcriptomics' in dataset.format()

8ebb432

generalized error handling

64ce81a

added handling of 'experiments' in dataset.format()

101ad74

added handling of copy_number

a397281

added handling of drug_descriptor

6bd844a

changed format('mutations') to use pd.pivot_table instead of `pd.…

7c7342c

…crosstab`

renamed download_by_prefix to download

8ab2d77

added option to download to specified folder in download()

8cba605

fixed import

7c0d4b2

added copy_number -> copy_call conversion

9184363

added utilization of __version__

7804cd6

added missing __init__ file

9862154

added helper function to list all available datasets

1b16b82

removed coderdata.DatasetLoader / coderdata.loader.*

28ea7e3

ymahlich requested review from sgosline and jjacobson95 December 16, 2024 20:00

ymahlich linked an issue Dec 16, 2024 that may be closed by this pull request

Update python functions to adhere to simpler standards and pre-format data #246

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

246 update python functions to adhere to simpler standards and pre format data #269

246 update python functions to adhere to simpler standards and pre format data #269

ymahlich commented Dec 16, 2024

246 update python functions to adhere to simpler standards and pre format data #269

Are you sure you want to change the base?

246 update python functions to adhere to simpler standards and pre format data #269

Conversation

ymahlich commented Dec 16, 2024

Refactor / rewrite of DatasetLoader into Dataset

rewrite / addition of functions in coderdata:

Refactor / rewrite of `DatasetLoader` into `Dataset`