Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cell Line Datasets Separated, Improved Error Handling for all Build Scripts #237

Merged
merged 33 commits into from
Nov 12, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
a7d0c8f
Introduced splitting script. Updated build scripts. Added all new sch…
jjacobson95 Oct 18, 2024
260dacf
multiple schema check fixes
jjacobson95 Oct 18, 2024
3e5a564
Working on testing Seperating broad_sanger
jjacobson95 Oct 21, 2024
2dc238c
Working on testing Seperating broad_sanger
jjacobson95 Oct 21, 2024
c661f7a
Error traps and exit codes added. This should allow errors to stop co…
jjacobson95 Oct 21, 2024
1eee16f
switching from sh to bash to allow for error propagation
jjacobson95 Oct 22, 2024
649a05e
Merge remote-tracking branch 'origin/main' into split_broad_sanger
jjacobson95 Oct 22, 2024
c18a979
pinning polars in broad_sanger
Oct 22, 2024
39c0c6c
nci60 concat fix
jjacobson95 Oct 23, 2024
d97f418
bug fix
jjacobson95 Oct 23, 2024
c810335
nci fix
jjacobson95 Oct 24, 2024
9ac6b0b
fix
jjacobson95 Oct 25, 2024
fd54b97
tiny bug fix
jjacobson95 Oct 28, 2024
9f7520d
tiny bug fix 2
jjacobson95 Oct 28, 2024
419e389
Tracked down issue causing beataml failure. ensembl moved/removed a file
jjacobson95 Oct 28, 2024
a38702c
working on mpnst fix. will add issue if this doesn't resolve it
jjacobson95 Oct 30, 2024
b443d16
BeatAML schema fix
jjacobson95 Oct 30, 2024
895a1aa
BeatAML schema fix
jjacobson95 Oct 30, 2024
bcb6bb4
Removed old code from beataml
jjacobson95 Oct 30, 2024
b7c39b4
allowed hill slope to be positive
sgosline Nov 1, 2024
d0a8762
Testing New Schema Checker & Build Process
jjacobson95 Nov 1, 2024
208e0ad
updated dose_response_value to any in schema
jjacobson95 Nov 1, 2024
3fd4b95
Merge remote-tracking branch 'origin/curve-fit-change' into split_bro…
jjacobson95 Nov 5, 2024
9ace774
working on 05_separate_datasets.py bug
Nov 5, 2024
d1731c1
Merge remote-tracking branch 'refs/remotes/origin/split_broad_sanger'…
Nov 5, 2024
c28bd78
Moving some changes from local to test on AWS. added memory debug
jjacobson95 Nov 5, 2024
ace675d
Still working on memory issues with broad_Sanger splitting
jjacobson95 Nov 5, 2024
9614770
Reverting version. Updated docker settings
jjacobson95 Nov 5, 2024
ec06b7d
Full rework of broad_sanger splitter. Now uses polars and optimized f…
jjacobson95 Nov 6, 2024
640f949
Schema update
jjacobson95 Nov 11, 2024
8efb440
Final update
Nov 11, 2024
20b3d36
last update
Nov 11, 2024
85ed415
Removed old schema checking code.
jjacobson95 Nov 11, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 1 addition & 2 deletions .dockerignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,5 +4,4 @@ coderdata/
dataSummary/
docs/
candle_bmd/
schema/
build/local/
build/local/
53 changes: 18 additions & 35 deletions build/beatAML/GetBeatAML.py
Original file line number Diff line number Diff line change
Expand Up @@ -174,10 +174,8 @@ def retrieve_drug_info(compound_name):
return np.nan, np.nan, np.nan, np.nan, np.nan, np.nan

data = response.json()
#print(data)
if "PropertyTable" in data:
properties = data["PropertyTable"]["Properties"][0]
#print(properties)
pubchem_id = properties.get('CID',np.nan)
canSMILES = properties.get("CanonicalSMILES", np.nan)
isoSMILES = properties.get("IsomericSMILES", np.nan)
Expand Down Expand Up @@ -259,9 +257,6 @@ def merge_drug_info(d_df,drug_map):
pd.DataFrame
The merged dataframe containing combined drug information.
"""
#print(drug_map)
#print(d_df.columns)
#print(d_df)
print(d_df['isoSMILES'].dtype, drug_map['isoSMILES'].dtype)
d_df['isoSMILES'] = d_df['isoSMILES'].astype(str)
drug_map['isoSMILES'] = drug_map['isoSMILES'].astype(str)
Expand Down Expand Up @@ -337,10 +332,9 @@ def add_improve_id(previous_df, new_df):
"""
if not previous_df.empty and 'improve_drug_id' in previous_df.columns:
id_list = [int(val.replace('SMI_', '')) for val in previous_df['improve_drug_id'].tolist() if pd.notnull(val) and val.startswith('SMI_')]
max_id = max(id_list) if id_list else 0 # Default to 0 if the list is empty
max_id = max(id_list) if id_list else 0
else:
max_id = 0 # Default value if the DataFrame is empty or doesn't have the column
# max_id = max([int(val.replace('SMI_', '')) for val in previous_df['improve_drug_id'].tolist() if pd.notnull(val) and val.startswith('SMI_')])
max_id = 0
# Identify isoSMILES in the new dataframe that don't exist in the old dataframe
unique_new_smiles = set(new_df['isoSMILES']) - set(previous_df['isoSMILES'])
# Identify rows in the new dataframe with isoSMILES that are unique and where improve_drug_id is NaN
Expand Down Expand Up @@ -370,24 +364,9 @@ def map_exp_to_improve(exp_path):#df,improve_map_file):
pd.DataFrame
Mapped dataframe with 'improve_sample_id' added and 'sample_id' removed.
"""
mapped_df = pd.read_csv(exp_path,sep='\t') # Map sample_id to improve_sample_id
#mapped_df = pd.merge(df, improve[['other_id', 'improve_sample_id']], left_on='sample_id', right_on='other_id', how='left')
#mapped_df.drop(columns=['sample_id', 'other_id'], inplace=True)
#mapped_df.insert(0, 'improve_sample_id', mapped_df.pop('improve_sample_id'))
mapped_df = pd.read_csv(exp_path,sep='\t')
mapped_df['source'] = 'synapse'
mapped_df['study'] = 'BeatAML'
#mapped_df= mapped_df.rename(columns={'Drug':'improve_sample_id',
# 'IC50':'ic50',
# 'EC50':'ec50',
# 'EC50se':'ec50se',
# 'Einf':'einf',
# 'HS':'hs',
# 'AAC1':'aac1',
# 'AUC1':'auc1',
# 'DSS1':'dss1',
# 'R2fit':'r2fit'
# }
# )
return mapped_df


Expand Down Expand Up @@ -445,12 +424,21 @@ def map_and_combine(df, data_type, entrez_map_file, improve_map_file, map_file=N
mapped_df.rename(columns={"hgvsc": "mutation"}, inplace=True)
mapped_df.rename(columns={"labId": "sample_id"}, inplace=True)
mapped_df.rename(columns={"Entrez_Gene_Id": "entrez_id"}, inplace=True)

elif data_type == "mutation":
df = df[['dbgap_sample_id','hgvsc', 'hgvsp', 'gene', 'variant_classification','t_vaf', 'refseq', 'symbol']]
mapped_df = df.merge(genes, left_on='symbol', right_on='gene_symbol', how='left').reindex(
columns=['hgvsc', 'entrez_id', "dbgap_sample_id","variant_classification"])

variant_mapping = {
'frameshift_variant': 'Frameshift_Variant',
'missense_variant': 'Missense_Mutation',
'stop_gained': 'Nonsense_Mutation',
'inframe_deletion': 'In_Frame_Del',
'protein_altering_variant': 'Protein_Altering_Variant',
'splice_acceptor_variant': 'Splice_Site',
'splice_donor_variant': 'Splice_Site',
'start_lost': 'Start_Codon_Del',
'inframe_insertion': 'In_Frame_Ins',
'stop_lost': 'Nonstop_Mutation'
}

mapped_df['variant_classification'] = mapped_df['variant_classification'].map(variant_mapping)

elif data_type == "proteomics":
mapped_ids['sampleID'] = mapped_ids['sampleID'].str.split('_').apply(lambda x: x[2])
Expand All @@ -473,7 +461,6 @@ def map_and_combine(df, data_type, entrez_map_file, improve_map_file, map_file=N
inplace=True
)


mapped_df = pd.merge(mapped_df, improve[['other_id', 'improve_sample_id']],
left_on='sample_id',
right_on='other_id',
Expand All @@ -482,7 +469,7 @@ def map_and_combine(df, data_type, entrez_map_file, improve_map_file, map_file=N
mapped_df['source'] = 'synapse'
mapped_df['study'] = 'BeatAML'

final_dataframe = mapped_df.dropna()#pd.dropna(mapped_df,0)
final_dataframe = mapped_df.dropna()
return final_dataframe


Expand Down Expand Up @@ -659,8 +646,6 @@ def generate_drug_list(drug_map_path,drug_path):


t_df = pd.read_csv('tpm_'+transcriptomics_file, sep = '\t')
# t_df.index = t_df.stable_id#display_label
# t_df = t_df.iloc[:, 4:]
t_df = t_df.reset_index().rename(columns={'stable_id': 'Gene'})
t_df = pd.melt(t_df, id_vars=['Gene'], var_name='sample_id', value_name='transcriptomics')
print(improve_map_file)
Expand Down Expand Up @@ -724,7 +709,5 @@ def generate_drug_list(drug_map_path,drug_path):
exp_res = map_exp_to_improve(drug_path)
exp_res.to_csv("/tmp/beataml_experiments.tsv", index=False, sep='\t')

#drug_map_path = retrieve_figshare_data("https://figshare.com/ndownloader/files/43112314?private_link=0ea222d9bd461c756fb0")

# print("Finished Pipeline")

8 changes: 8 additions & 0 deletions build/beatAML/build_drugs.sh
Original file line number Diff line number Diff line change
@@ -1,2 +1,10 @@
#!/bin/bash
set -euo pipefail

trap 'echo "Error on or near line $LINENO while executing: $BASH_COMMAND"; exit 1' ERR

echo "Running GetBeatAML.py with token and drugFile $1"
python GetBeatAML.py --token $SYNAPSE_AUTH_TOKEN --drugs --drugFile $1

echo "Running build_drug_desc.py..."
python build_drug_desc.py --drugtable /tmp/beataml_drugs.tsv --desctable /tmp/beataml_drug_descriptors.tsv.gz
6 changes: 6 additions & 0 deletions build/beatAML/build_exp.sh
Original file line number Diff line number Diff line change
@@ -1 +1,7 @@
#!/bin/bash
set -euo pipefail

trap 'echo "Error on or near line $LINENO while executing: $BASH_COMMAND"; exit 1' ERR

echo "Running GetBeatAML.py with token and curSamples $1 and drugFile $2."
python GetBeatAML.py --exp --token $SYNAPSE_AUTH_TOKEN --curSamples $1 --drugFile $2
6 changes: 6 additions & 0 deletions build/beatAML/build_omics.sh
Original file line number Diff line number Diff line change
@@ -1 +1,7 @@
#!/bin/bash
set -euo pipefail

trap 'echo "Error on or near line $LINENO while executing: $BASH_COMMAND"; exit 1' ERR

echo "Running GetBeatAML.py with token, curSamples $2, and genes $1."
python GetBeatAML.py --token $SYNAPSE_AUTH_TOKEN --omics --curSamples $2 --genes $1
6 changes: 6 additions & 0 deletions build/beatAML/build_samples.sh
Original file line number Diff line number Diff line change
@@ -1 +1,7 @@
#!/bin/bash
set -euo pipefail

trap 'echo "Error on or near line $LINENO while executing: $BASH_COMMAND"; exit 1' ERR

echo "Running GetBeatAML.py with token and prevSamples $1."
python GetBeatAML.py --token $SYNAPSE_AUTH_TOKEN --samples --prevSamples $1
18 changes: 15 additions & 3 deletions build/broad_sanger/03a-nci60Drugs.py
Original file line number Diff line number Diff line change
Expand Up @@ -122,9 +122,21 @@ def main():
merged = pl.concat([mdf,namedf],how='horizontal').select(['SMILES','pubchem_id','nscid','lower_name'])
melted = merged.melt(id_vars=['SMILES','pubchem_id'],value_vars=['nscid','lower_name']).select(['SMILES','pubchem_id','value']).unique()
melted.columns = ['canSMILES','pubchem_id','chem_name']
if newdf.shape[0]>0:
newdf = newdf.join(melted,on='canSMILES',how='inner').select(res.columns)
res = pl.concat([res,newdf],how='vertical')

if newdf.shape[0] > 0:
res = res.with_columns([
pl.col("InChIKey").cast(pl.Utf8),
pl.col("formula").cast(pl.Utf8),
pl.col("weight").cast(pl.Utf8)
])
newdf = newdf.with_columns([
pl.col("InChIKey").cast(pl.Utf8),
pl.col("formula").cast(pl.Utf8),
pl.col("weight").cast(pl.Utf8)
])

newdf = newdf.join(melted, on='canSMILES', how='inner').select(res.columns)
res = pl.concat([res, newdf], how='vertical')
res.write_csv(opts.output,separator='\t')

if __name__=='__main__':
Expand Down
83 changes: 83 additions & 0 deletions build/broad_sanger/05_separate_datasets.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
import gc
import polars as pl



def main():

datasets_to_process = ["CCLE", "CTRPv2", "PRISM", "GDSCv1", "GDSCv2", "FIMM", "gCSI", "NCI60"]
omics_datatypes = ["transcriptomics","proteomics", "copy_number","mutations"] # csv
samples_datatypes = ["samples"] #csv

drugs_datatypes = ["drugs", "drug_descriptors"] # tsv


dataset_sources = {
"CCLE": ["Broad"],
"CTRPv2": ["Broad"],
"PRISM": ["Broad"],
"GDSCv1": ["Sanger"],
"GDSCv2": ["Sanger"],
"FIMM": ["Broad"],
"gCSI": ["Broad"], # gCSI generates its own omics data but it is comparable to CCLE. In future, retrive gCSI omics.
"NCI60": ["Broad"]
}

for dataset in datasets_to_process:
exp = pl.read_csv("broad_sanger_experiments.tsv", separator="\t") # Keeping memory down, so I will not be making copies.
exp = exp.filter(pl.col("study") == dataset)

# Extract information to separate out datasets
exp_improve_sample_ids = exp["improve_sample_id"].unique().to_list()
exp_improve_drug_ids = exp["improve_drug_id"].unique().to_list()

# Write Filtered Experiments File to TSV. Then delete it from memory.
exp_filename = f"/tmp/{dataset}_experiments.tsv".lower()
exp.write_csv(exp_filename, separator="\t")
del exp
gc.collect()


#Filter Samples files, write to file, delete from mem.
for samples in samples_datatypes:
samples_filename_in = f"broad_sanger_{samples}.csv"
samples_filename_out = f"/tmp/{dataset}_{samples}.csv".lower()
samples_df = pl.read_csv(samples_filename_in)
samples_df = samples_df.filter(pl.col("improve_sample_id").is_in(exp_improve_sample_ids))
samples_df.write_csv(samples_filename_out) #csv
del samples_df
gc.collect()

#One by one, filter other Omics files, write to file, delete from mem.
for omics in omics_datatypes:
omics_filename_in = f"broad_sanger_{omics}.csv"
omics_filename_out = f"/tmp/{dataset}_{omics}.csv".lower()
omics_df = pl.read_csv(omics_filename_in)
omics_df = omics_df.filter(pl.col("improve_sample_id").is_in(exp_improve_sample_ids))
omics_df = omics_df.filter(pl.col("source").is_in(dataset_sources[dataset]))
omics_df.write_csv(omics_filename_out) #csv
del omics_df
gc.collect()


#One by one, filter other Drugs files, write to file, delete from mem.
for drugs in drugs_datatypes:
drugs_filename_in = f"broad_sanger_{drugs}.tsv"
drugs_filename_out = f"/tmp/{dataset}_{drugs}.tsv".lower()
if drugs == "drug_descriptors":
drugs_df = pl.read_csv(drugs_filename_in,separator="\t",
dtypes={"improve_drug_id": pl.Utf8,
"structural_descriptor": pl.Utf8,
"descriptor_value": pl.Utf8}
)

else:
drugs_df = pl.read_csv(drugs_filename_in,separator="\t")

drugs_df = drugs_df.filter(pl.col("improve_drug_id").is_in(exp_improve_drug_ids))
drugs_df.write_csv(drugs_filename_out,separator="\t") #tsv
del drugs_df
gc.collect()

if __name__ == "__main__":
main()
16 changes: 14 additions & 2 deletions build/broad_sanger/build_drugs.sh
Original file line number Diff line number Diff line change
@@ -1,3 +1,15 @@
/opt/venv/bin/python 03a-nci60Drugs.py
#!/bin/bash
set -euo pipefail

trap 'echo "Error on or near line $LINENO while executing: $BASH_COMMAND"; exit 1' ERR

echo "Running 03a-nci60Drugs.py..."
/opt/venv/bin/python 03a-nci60Drugs.py

echo "Running 03-createDrugFile.R..."
Rscript 03-createDrugFile.R CTRPv2,GDSC,gCSI,PRISM,CCLE,FIMM
/opt/venv/bin/python build_drug_desc.py --drugtable /tmp/broad_sanger_drugs.tsv --desctable /tmp/broad_sanger_drug_descriptors.tsv.gz

echo "Running build_drug_desc.py..."
/opt/venv/bin/python build_drug_desc.py \
--drugtable /tmp/broad_sanger_drugs.tsv \
--desctable /tmp/broad_sanger_drug_descriptors.tsv.gz
8 changes: 7 additions & 1 deletion build/broad_sanger/build_exp.sh
Original file line number Diff line number Diff line change
@@ -1 +1,7 @@
/opt/venv/bin/python 04-drug_dosage_and_curves.py --drugfile $2 --curSampleFile $1
#!/bin/bash
set -euo pipefail

trap 'echo "Error on or near line $LINENO while executing: $BASH_COMMAND"; exit 1' ERR

echo "Running 04-drug_dosage_and_curves.py with drugfile $2 and curSampleFile $1"
/opt/venv/bin/python 04-drug_dosage_and_curves.py --drugfile $2 --curSampleFile $1
11 changes: 11 additions & 0 deletions build/broad_sanger/build_misc.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
#!/bin/bash
set -euo pipefail

trap 'echo "Error on or near line $LINENO while executing: $BASH_COMMAND"; exit 1' ERR

cp /tmp/broad_sanger* .
echo "Running 05_separate_datasets.py..."
/opt/venv/bin/python 05_separate_datasets.py

echo "Removing broad_sanger* files..."
rm broad_sanger*
9 changes: 8 additions & 1 deletion build/broad_sanger/build_omics.sh
Original file line number Diff line number Diff line change
@@ -1,3 +1,10 @@
#!/bin/bash
set -euo pipefail

trap 'echo "Error on or near line $LINENO while executing: $BASH_COMMAND"; exit 1' ERR

echo "Running 02a-broad_sanger_proteomics.py with gene file $1 and sample file $2."
/opt/venv/bin/python 02a-broad_sanger_proteomics.py --gene $1 --sample $2

echo "Running 02-broadSangerOmics.R with gene file $1 and sample file $2,"
Rscript 02-broadSangerOmics.R $1 $2
#python 02a-broad/sanger_proteomics.py $1 $2
6 changes: 6 additions & 0 deletions build/broad_sanger/build_samples.sh
Original file line number Diff line number Diff line change
@@ -1 +1,7 @@
#!/bin/bash
set -euo pipefail

trap 'echo "Error on or near line $LINENO while executing: $BASH_COMMAND"; exit 1' ERR

echo "Running 01-broadSangerSamples.R."
Rscript 01-broadSangerSamples.R
4 changes: 3 additions & 1 deletion build/broad_sanger/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,8 @@ scikit-learn
scipy
requests
openpyxl
polars
polars==0.19.17
mordredcommunity
rdkit
coderdata==0.1.40
psutil
Loading