Skip to content

Commit

Permalink
Merge pull request #226 from PNNL-CompBio/build_all_updates
Browse files Browse the repository at this point in the history
Several Build Updates
  • Loading branch information
sgosline authored Oct 14, 2024
2 parents 2a6dd85 + 19548a9 commit 1fc3e75
Show file tree
Hide file tree
Showing 24 changed files with 1,156 additions and 584 deletions.
70 changes: 57 additions & 13 deletions build/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,30 +12,75 @@ are added.

This script initializes all docker containers, builds all datasets, validates them, and uploads them to figshare and pypi.

It requires the following authorization tokens to be set in the local environment depending on the use case:
`SYNAPSE_AUTH_TOKEN`: Required for beataml and mpnst datasets. Join the [CoderData team](https://www.synapse.org/#!Team:3503472) on Synapse and generate an access token.
`PYPI_TOKEN`: This token is required to upload to PyPI.
`FIGSHARE_TOKEN`: This token is required to upload to Figshare.
It requires the following authorization tokens to be set in the local environment depending on the use case:
`SYNAPSE_AUTH_TOKEN`: Required for beataml and mpnst datasets. Join the [CoderData team](https://www.synapse.org/#!Team:3503472) on Synapse and generate an access token.
`PYPI_TOKEN`: This token is required to upload to PyPI.
`FIGSHARE_TOKEN`: This token is required to upload to Figshare.
`GITHUB_TOKEN`: This token is required to upload to GitHub.

Available arguments:
**Available arguments**:

- `--docker`: Initializes and builds all docker containers.
- `--samples`: Processes and builds the sample data files.
- `--omics`: Processes and builds the omics data files.
- `--drugs`: Processes and builds the drug data files.
- `--exp`: Processes and builds the experiment data files.
- `--all`: Executes all available processes above (docker, samples, omics, drugs, exp).
- `--validate`: Validates the generated datasets using the schema check scripts.
- `--figshare`: Uploads the datasets to Figshare.
- `--pypi`: Uploads the package to PyPI.
- `--high_mem`: Utilizes high memory mode for concurrent data processing.
- `--all`: Executes all available processes above (docker, samples, omics, drugs, exp). This does not run the validate, figshare, or pypi commands.
- `--validate`: Validates the generated datasets using the schema check scripts. This is automatically included if data upload occurs.
- `--figshare`: Uploads the datasets to Figshare. FIGSHARE_TOKEN must be set in local environment.
- `--pypi`: Uploads the package to PyPI. PYPI_TOKEN must be set in local environment.
- `--high_mem`: Utilizes high memory mode for concurrent data processing. This has been successfully tested using 32 or more vCPUs.
- `--dataset`: Specifies the datasets to process (default='broad_sanger,hcmi,beataml,mpnst,cptac').
- `--version`: Specifies the version number for the package and data upload title. This is required to upload to figshare and PyPI
- `--version`: Specifies the version number for the PyPI package and Figshare upload title (e.g., "0.1.29"). This is required for figshare and PyPI upload steps. This must be a higher version than previously published versions.
- `--github-username`: GitHub username matching the GITHUB_TOKEN. Required to push the new Tag to the GitHub Repository.
- `--github-email`: GitHub email matching the GITHUB_TOKEN. Required to push the new Tag to the GitHub Repository.

**Example usage**:
- Build all datasets and upload to Figshare and PyPI and GitHub.
Required tokens for the following command: `SYNAPSE_AUTH_TOKEN`, `PYPI_TOKEN`, `FIGSHARE_TOKEN`, `GITHUB_TOKEN`.
```bash
python build/build_all.py --all --high_mem --validate --pypi --figshare --version 0.1.41 --github-username jjacobson95 --github-email [email protected]
```

- Build only the experiment files.
**Note**: Preceding steps will not automatically be run. This assumes that docker images, samples, omics, and drugs were all previously built. Ensure all required tokens are set.
```bash
python build/build_all.py --exp
```

## build_dataset.py script
This script builds a single dataset for **debugging purposes only**. It can help determine if a dataset will build correctly in isolation. Note that the sample and drug identifiers generated may not align with those from other datasets, so this script is not suitable for building production datasets.

It requires the following authorization tokens to be set in the local environment depending on the dataset:

`SYNAPSE_AUTH_TOKEN`: Required for beataml and mpnst datasets. Follow the directions above to use gain access.

Available arguments:
- `--dataset`: Required. Name of the dataset to build.
- `--use_prev_dataset`: Optional. Prefix of the previous dataset for sample and drug ID continuation. The previous dataset files must be in the "local" directory.
- `--validate`: Optional. Runs the schema checker on the built files.
- `--continue`: Optional. Continues from where the build left off by skipping existing files in "local" directory.
Example usage:

Build the broad_sanger dataset:
```bash
python build/build_all.py --all --high_mem --validate --pypi --figshare --version 0.1.29
python build/build_dataset.py --dataset broad_sanger
```
Build the mpnst dataset continuing from broad_sanger sample and drug IDs:
```bash
python build/build_dataset.py --dataset mpnst --use_prev_dataset broad_sanger
```
Build the hcmi dataset and run validation:
```bash
python build/build_dataset.py --dataset hcmi --validate
```
Build the broad_sanger dataset but skip previously built files in "local" directory:
```bash
python build/build_dataset.py --dataset broad_sanger --continue
```




## Data Source Reference List

Expand Down Expand Up @@ -66,4 +111,3 @@ python build/build_all.py --all --high_mem --validate --pypi --figshare --versio
| BeatAML | NCI Proteomic Data Commons | Mapping the proteogenomic landscape enables prediction of drug response in acute myeloid leukemia | James Pino et al. | 23
| MPNST | NF Data Portal | Chromosome 8 gain is associated with high-grade transformation in MPNST | David P Nusinow et al. | 24


52 changes: 31 additions & 21 deletions build/beatAML/GetBeatAML.py
Original file line number Diff line number Diff line change
Expand Up @@ -134,8 +134,11 @@ def generate_samples_file(prev_samples_path):
prot_samples["other_id_source"] = "beatAML"

all_samples = pd.concat([prot_samples, full_samples])
all_samples['species'] = 'Homo sapiens'
maxval = max(pd.read_csv(prev_samples_path).improve_sample_id)
all_samples['species'] = 'Homo sapiens (Human)'
if prev_samples_path == "":
maxval = 0
else:
maxval = max(pd.read_csv(prev_samples_path).improve_sample_id)
mapping = {labId: i for i, labId in enumerate(all_samples['other_id'].unique(), start=(int(maxval)+1))}
all_samples['improve_sample_id'] = all_samples['other_id'].map(mapping)
all_samples.insert(1, 'improve_sample_id', all_samples.pop('improve_sample_id'))
Expand Down Expand Up @@ -282,8 +285,14 @@ def format_drug_map(drug_map_path):
pd.DataFrame
Formatted and cleaned drug mapping dataframe.
"""
drug_map = pd.read_csv(drug_map_path, sep = "\t")
drug_map = drug_map.drop_duplicates(subset='isoSMILES', keep='first')
if drug_map_path:
drug_map = pd.read_csv(drug_map_path, sep = "\t")
drug_map = drug_map.drop_duplicates(subset='isoSMILES', keep='first')
else:
drug_map = pd.DataFrame(columns=[
'improve_drug_id', 'chem_name', 'pubchem_id', 'canSMILES',
'isoSMILES', 'InChIKey', 'formula', 'weight'
])
return drug_map

#Drug Response
Expand Down Expand Up @@ -326,7 +335,12 @@ def add_improve_id(previous_df, new_df):
pd.DataFrame
New dataframe with 'improve_drug_id' added.
"""
max_id = max([int(val.replace('SMI_', '')) for val in previous_df['improve_drug_id'].tolist() if pd.notnull(val) and val.startswith('SMI_')])
if not previous_df.empty and 'improve_drug_id' in previous_df.columns:
id_list = [int(val.replace('SMI_', '')) for val in previous_df['improve_drug_id'].tolist() if pd.notnull(val) and val.startswith('SMI_')]
max_id = max(id_list) if id_list else 0 # Default to 0 if the list is empty
else:
max_id = 0 # Default value if the DataFrame is empty or doesn't have the column
# max_id = max([int(val.replace('SMI_', '')) for val in previous_df['improve_drug_id'].tolist() if pd.notnull(val) and val.startswith('SMI_')])
# Identify isoSMILES in the new dataframe that don't exist in the old dataframe
unique_new_smiles = set(new_df['isoSMILES']) - set(previous_df['isoSMILES'])
# Identify rows in the new dataframe with isoSMILES that are unique and where improve_drug_id is NaN
Expand Down Expand Up @@ -552,10 +566,10 @@ def generate_drug_list(drug_map_path,drug_path):
##the next three arguments determine what we'll do

parser.add_argument('-s', '--samples', action = 'store_true', help='Only generate samples, requires previous samples',default=False)
parser.add_argument('-p', '--prevSamples', type=str, help='Use this to provide previous sample file, will run sample file generation',default='')
parser.add_argument('-p', '--prevSamples', nargs='?',type=str, default='', const='', help='Use this to provide previous sample file, will run sample file generation')

parser.add_argument('-d', '--drugs',action='store_true', default=False,help='Query drugs only, requires drug file')
parser.add_argument('-r', '--drugFile',type=str,help='Path to existing drugs.tsv file to query')
parser.add_argument('-r', '--drugFile',nargs='?',type=str, default='', const='',help='Path to existing drugs.tsv file to query')

parser.add_argument('-o', '--omics',action='store_true',default=False,help='Set this flag to query omics, requires current samples')
parser.add_argument('-c', '--curSamples', type=str, help='Add path if you want to generate data')
Expand Down Expand Up @@ -604,27 +618,23 @@ def generate_drug_list(drug_map_path,drug_path):
supplimentary_file = '1-s2.0-S1535610822003129-mmc2.xlsx'
download_from_github(supplementary_url, supplimentary_file)

#prev_samples_path = "hcmi_samples.csv"
#improve_map_file = "/tmp/beataml_samples.csv"

if args.samples:
if args.prevSamples is None or args.prevSamples=='':
print("Cannot run sample file generation without previous samples")
exit()
print("No Previous Samples file was found. Data will not align with other datasets. Use ONLY for testing purposes.")
else:
print("Only running Samples File Generation")
prev_samples_path = args.prevSamples
#Generate Samples File
generate_samples_file(prev_samples_path)
print("Previous Samples File Provided. Running BeatAML Sample File Generation")
#Generate Samples File
generate_samples_file(args.prevSamples)
if args.drugs:
if args.drugFile is None or args.drugFile=='':
print("Cannot run drug matching without prior drug file")
exit()
print("Prior Drug File not provided. Data will not align with other datasets. Use ONLY for testing purposes.")
else:
original_drug_file = "beataml_wv1to4_raw_inhibitor_v4_dbgap.txt"
original_drug_url = "https://github.com/biodev/beataml2.0_data/raw/main/beataml_wv1to4_raw_inhibitor_v4_dbgap.txt"
download_from_github(original_drug_url, original_drug_file)
generate_drug_list(args.drugFile, original_drug_file) ##this doesn't exist, need to add
print("Drug File Provided. Proceeding with build.")
original_drug_file = "beataml_wv1to4_raw_inhibitor_v4_dbgap.txt"
original_drug_url = "https://github.com/biodev/beataml2.0_data/raw/main/beataml_wv1to4_raw_inhibitor_v4_dbgap.txt"
download_from_github(original_drug_url, original_drug_file)
generate_drug_list(args.drugFile, original_drug_file) ##this doesn't exist, need to add
if args.omics:
if args.genes is None or args.curSamples is None:
print('Cannot process omics without sample mapping and gene mapping files')
Expand Down
2 changes: 1 addition & 1 deletion build/broad_sanger/04b-nci60-updated.py
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,7 @@ def main():

newnames = pl.DataFrame(
{
'new_name':[re.split(' |\(|\/',a)[0] for a in nulls['CELL_NAME']],
'new_name': [re.split(r' |\(|/', a)[0] for a in nulls['CELL_NAME']],
'CELL_NAME':nulls['CELL_NAME']
}
)
Expand Down
Loading

0 comments on commit 1fc3e75

Please sign in to comment.