Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update python functions to adhere to simpler standards and pre-format data #246

Open
sgosline opened this issue Nov 12, 2024 · 4 comments · May be fixed by #269
Open

Update python functions to adhere to simpler standards and pre-format data #246

sgosline opened this issue Nov 12, 2024 · 4 comments · May be fixed by #269
Assignees
Labels
enhancement New feature or request

Comments

@sgosline
Copy link
Member

We discussed some basic alterations to the python functions that are described in the doc-update branch here:
https://github.com/PNNL-CompBio/coderdata/blob/doc-update/README.md

These changes entail:

  1. renaming DatasetLoader to just dataset
  2. creating functions to list,download, and load for the CoderData package
  3. creating functions of the new dataset object including: train_test_validate (already exists), save, types and format

This will resolve #228 and #229, which I will close as duplicates.

@ymahlich
Copy link
Collaborator

Clarification on Dataset class attributes

As per 2ec2c62 the Dataset object should (can?) contain:

  • transcriptomics
  • mutations
  • copy_numbers
  • proteomics
  • experiments
  • combinations
  • drugs
  • genes
  • samples

Currently the DataLoader object (to be refactored into the Dataset object) also contains:

  • mirna
  • methylation
  • metabolomics
  • full (? - I am guessing this is a leftover from the idea of creating one "master-table" containing all other information?)

It is missing:

  • combinations

Do we want keep the additional attributes (and add combinations) or should we remove them and add them back as needed / when we have datasets that include those data types?

Existing data sets on figshare that currently don't get imported:

The way data ingestion is currently implemented is the 'loader' checks if the data_type descriptor that is in the file name also is an attribute in the DataLoader Object. If it is, the loader imports the contents of the file and stores it in the object.

For example: Assuming we downloaded all BeatAML files from figshare and load beataml (data = DataLoader('beataml') the loader will find a file called beataml_drugs.tsv.gz. The loader then extracts from said file name that the data_type should be drugs, "sees" that there is a drugs attribute in DataLoader and therefore imports the data file and stores it in DataLoader.drugs.
The problem with that is that as of v.0.1.4 we have also files like beataml_drug_descriptors.tsv.gz which should be imported into DataLoader.drug_descriptor (I assume). That attribute doesn't exist as of now and therefore is NOT imported. Is this an oversight? Was that something that @jjacobson95 was planning on implementing but hadn't gotten to?

@ymahlich ymahlich added the enhancement New feature or request label Nov 19, 2024
@ymahlich ymahlich moved this to In progress in CoderData Nov 19, 2024
@sgosline
Copy link
Member Author

A few questions:

1- on what to import, i'd say we keep everything (except for full) and add in combinations. BUT only allow people to download what is available.
2- I'm not sure what to do about drug_descriptors. I feel like that should be loaded with the drugs, but can be open to adding another argument to the loader.

@ymahlich
Copy link
Collaborator

Just for clarification:

  1. Download (currently) is handled via the command line (i.e. > coderdata download [--prefix NAME]) which downloads everything on figshare (or the subset that shares the defined --prefix). I will be implementing a way to download via the API as well - I haven't looked into the downloader code of @jjacobson95 yet but I am assuming that I will be able to repurpose a lot and then just wrap that function for the CLI.
    Do you want to also be able to retrieve only specific data_type(s)? E.g. cd.download(directory=cwd, prefix='all', data_type='all') and > coderdata download [--prefix NAME] [--data_type DTYPE] for the API call and CLI respectively where data_type / --data_type would be used to define that we only want let's say samples.
  2. The "simplest" thing to do is just add a drug_descriptors attribute to Dataset, that would then automatically be populated during Dataset.load if a [dataset]_drug_descriptors.tsv.gz it available.

@sgosline
Copy link
Member Author

First off: please remove the prefix argument. It's hard to interpret and also not specific. Please use the dataset to describe the dataset.

  1. You can definitely add the data_type argument if you choose, but default to all.
  2. Can you identify the use case in which someone would want the drug information without the descriptors? If not just download them both at once.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: In progress
2 participants