Update python functions to adhere to simpler standards and pre-format data #246

sgosline · 2024-11-12T01:06:34Z

We discussed some basic alterations to the python functions that are described in the doc-update branch here:
https://github.com/PNNL-CompBio/coderdata/blob/doc-update/README.md

These changes entail:

renaming DatasetLoader to just dataset
creating functions to list,download, and load for the CoderData package
creating functions of the new dataset object including: train_test_validate (already exists), save, types and format

This will resolve #228 and #229, which I will close as duplicates.

The text was updated successfully, but these errors were encountered:

ymahlich · 2024-11-18T22:46:55Z

Clarification on `Dataset` class attributes

As per 2ec2c62 the Dataset object should (can?) contain:

transcriptomics
mutations
copy_numbers
proteomics
experiments
combinations
drugs
genes
samples

Currently the DataLoader object (to be refactored into the Dataset object) also contains:

mirna
methylation
metabolomics
full (? - I am guessing this is a leftover from the idea of creating one "master-table" containing all other information?)

It is missing:

combinations

Do we want keep the additional attributes (and add combinations) or should we remove them and add them back as needed / when we have datasets that include those data types?

Existing data sets on figshare that currently don't get imported:

The way data ingestion is currently implemented is the 'loader' checks if the data_type descriptor that is in the file name also is an attribute in the DataLoader Object. If it is, the loader imports the contents of the file and stores it in the object.

For example: Assuming we downloaded all BeatAML files from figshare and load beataml (data = DataLoader('beataml') the loader will find a file called beataml_drugs.tsv.gz. The loader then extracts from said file name that the data_type should be drugs, "sees" that there is a drugs attribute in DataLoader and therefore imports the data file and stores it in DataLoader.drugs.
The problem with that is that as of v.0.1.4 we have also files like beataml_drug_descriptors.tsv.gz which should be imported into DataLoader.drug_descriptor (I assume). That attribute doesn't exist as of now and therefore is NOT imported. Is this an oversight? Was that something that @jjacobson95 was planning on implementing but hadn't gotten to?

sgosline · 2024-11-19T00:13:11Z

A few questions:

1- on what to import, i'd say we keep everything (except for full) and add in combinations. BUT only allow people to download what is available.
2- I'm not sure what to do about drug_descriptors. I feel like that should be loaded with the drugs, but can be open to adding another argument to the loader.

ymahlich · 2024-11-19T00:48:04Z

Just for clarification:

Download (currently) is handled via the command line (i.e. > coderdata download [--prefix NAME]) which downloads everything on figshare (or the subset that shares the defined --prefix). I will be implementing a way to download via the API as well - I haven't looked into the downloader code of @jjacobson95 yet but I am assuming that I will be able to repurpose a lot and then just wrap that function for the CLI.
Do you want to also be able to retrieve only specific data_type(s)? E.g. cd.download(directory=cwd, prefix='all', data_type='all') and > coderdata download [--prefix NAME] [--data_type DTYPE] for the API call and CLI respectively where data_type / --data_type would be used to define that we only want let's say samples.
The "simplest" thing to do is just add a drug_descriptors attribute to Dataset, that would then automatically be populated during Dataset.load if a [dataset]_drug_descriptors.tsv.gz it available.

sgosline · 2024-11-19T00:55:53Z

First off: please remove the prefix argument. It's hard to interpret and also not specific. Please use the dataset to describe the dataset.

You can definitely add the data_type argument if you choose, but default to all.
Can you identify the use case in which someone would want the drug information without the descriptors? If not just download them both at once.

sgosline assigned ymahlich Nov 12, 2024

This was referenced Nov 12, 2024

Dataset consistency #229

Closed

Dataset description #228

Closed

Documentation update: Update instructions on Installing data #222

Closed

ymahlich added the enhancement New feature or request label Nov 19, 2024

ymahlich added this to CoderData Nov 19, 2024

ymahlich moved this to In progress in CoderData Nov 19, 2024

ymahlich linked a pull request Dec 16, 2024 that will close this issue

246 update python functions to adhere to simpler standards and pre format data #269

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update python functions to adhere to simpler standards and pre-format data #246

Update python functions to adhere to simpler standards and pre-format data #246

sgosline commented Nov 12, 2024

ymahlich commented Nov 18, 2024

sgosline commented Nov 19, 2024

ymahlich commented Nov 19, 2024

sgosline commented Nov 19, 2024

Update python functions to adhere to simpler standards and pre-format data #246

Update python functions to adhere to simpler standards and pre-format data #246

Comments

sgosline commented Nov 12, 2024

ymahlich commented Nov 18, 2024

Clarification on Dataset class attributes

Existing data sets on figshare that currently don't get imported:

sgosline commented Nov 19, 2024

ymahlich commented Nov 19, 2024

sgosline commented Nov 19, 2024

Clarification on `Dataset` class attributes