-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add SIDER #438
Merged
Merged
add SIDER #438
Changes from 10 commits
Commits
Show all changes
13 commits
Select commit
Hold shift + click to select a range
b255e6f
feat: add SIDER
AdrianM0 c13f15c
fix line endings
AdrianM0 7a05881
linting
AdrianM0 e929eb5
Update transform.py
AdrianM0 36f44ff
Update data/tabular/SIDER/transform.py
kjappelbaum 454928b
update script and meta
d0334b8
remove the ones that do not sound like disorders
AdrianM0 e039c05
fix: bug and add nouns
AdrianM0 2ceb67a
fix: remove raw data saver
AdrianM0 19b1942
fix: minor details
AdrianM0 72b49e6
chore: update dataframe columns
AdrianM0 c1d5d2d
chore: remove column
AdrianM0 f1d2270
fix: nr of classes
AdrianM0 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,223 @@ | ||
--- | ||
name: SIDER | ||
description: Database of marketed drugs and adverse drug reactions (ADR), grouped into 24 system organ classes. | ||
identifiers: | ||
- id: SMILES | ||
type: SMILES | ||
description: SMILES | ||
targets: | ||
- id: hepatobiliary_disorders | ||
description: hepatobiliary disorders | ||
type: boolean | ||
names: | ||
- noun: hepatobiliary disorders | ||
- noun: liver and gallbladder disorders | ||
- id: metabolism_and_nutrition_disorders | ||
description: metabolism and nutrition disorders | ||
type: boolean | ||
names: | ||
- noun: metabolism and nutrition disorders | ||
- noun: metabolic and nutritional disorders | ||
- id: eye_disorders | ||
description: eye disorders | ||
type: boolean | ||
names: | ||
- noun: eye disorders | ||
- noun: ophthalmic disorders | ||
- id: musculoskeletal_and_connective_tissue_disorders | ||
description: musculoskeletal and connective tissue disorders | ||
type: boolean | ||
names: | ||
- noun: musculoskeletal and connective tissue disorders | ||
- noun: muscle and joint disorders | ||
- id: gastrointestinal_disorders | ||
description: gastrointestinal disorders | ||
type: boolean | ||
names: | ||
- noun: gastrointestinal disorders | ||
- noun: digestive system disorders | ||
- id: immune_system_disorders | ||
description: immune system disorders | ||
type: boolean | ||
names: | ||
- noun: immune system disorders | ||
- noun: disorders of the immune system | ||
- id: reproductive_system_and_breast_disorders | ||
description: reproductive system and breast disorders | ||
type: boolean | ||
names: | ||
- noun: reproductive system and breast disorders | ||
- noun: disorders of the breasts and the reproductive system | ||
- id: neoplasms_benign_malignant_and_unspecified_(incl_cysts_and_polyps) | ||
description: neoplasms benign, malignant and unspecified (incl cysts and polyps) | ||
type: boolean | ||
names: | ||
- noun: neoplasms benign, malignant and unspecified (incl cysts and polyps) | ||
- noun: benign and malignant tumors (including cysts and polyps) | ||
- id: general_disorders_and_administration_site_conditions | ||
description: general disorders and administration site conditions | ||
type: boolean | ||
names: | ||
- noun: general disorders and administration site conditions | ||
- noun: general health and administration site conditions | ||
- id: endocrine_disorders | ||
description: endocrine disorders | ||
type: boolean | ||
names: | ||
- noun: endocrine disorders | ||
- noun: endocrine system disorders | ||
- id: surgical_and_medical_procedures | ||
description: surgical and medical procedures | ||
type: boolean | ||
names: | ||
- noun: surgical and medical procedures | ||
- noun: medical and surgical procedures | ||
- id: vascular_disorders | ||
description: vascular disorders | ||
type: boolean | ||
names: | ||
- noun: vascular disorders | ||
- noun: vascular system disorders | ||
- id: blood_and_lymphatic_system_disorders | ||
description: blood and lymphatic system disorders | ||
type: boolean | ||
names: | ||
- noun: blood and lymphatic system disorders | ||
- noun: disorders of the blood and lymphatic system | ||
- id: skin_and_subcutaneous_tissue_disorders | ||
description: skin and subcutaneous tissue disorders | ||
type: boolean | ||
names: | ||
- noun: skin and subcutaneous tissue disorders | ||
- noun: disorders of the skin and subcutaneous tissue | ||
- id: congenital_familial_and_genetic_disorders | ||
description: congenital, familial and genetic disorders | ||
type: boolean | ||
names: | ||
- noun: congenital, familial and genetic disorders | ||
- noun: familial, congenital and genetic disorders | ||
- id: infections_and_infestations | ||
description: infections and infestations | ||
type: boolean | ||
names: | ||
- noun: infections and infestations | ||
- noun: infestations and infections | ||
- id: respiratory_thoracic_and_mediastinal_disorders | ||
description: respiratory, thoracic and mediastinal disorders | ||
type: boolean | ||
names: | ||
- noun: respiratory, thoracic and mediastinal disorders | ||
- noun: respiratory and thoracic disorders | ||
- id: psychiatric_disorders | ||
description: psychiatric disorders | ||
type: boolean | ||
names: | ||
- noun: psychiatric disorders | ||
- noun: mental health and psychiatric disorders | ||
- id: renal_and_urinary_disorders | ||
description: renal and urinary disorders | ||
type: boolean | ||
names: | ||
- noun: renal and urinary disorders | ||
- noun: kidney and urinary tract disorders | ||
- id: pregnancy_puerperium_and_perinatal_conditions | ||
description: pregnancy, puerperium and perinatal conditions | ||
type: boolean | ||
names: | ||
- noun: pregnancy, puerperium and perinatal conditions | ||
- noun: pregnancy, childbirth, and newborn conditions | ||
- id: ear_and_labyrinth_disorders | ||
description: ear and labyrinth disorders | ||
type: boolean | ||
names: | ||
- noun: ear and labyrinth disorders | ||
- noun: ear and inner ear disorders | ||
- id: cardiac_disorders | ||
description: cardiac disorders | ||
type: boolean | ||
names: | ||
- noun: cardiac disorders | ||
- noun: cardiovascular disorders | ||
- id: nervous_system_disorders | ||
description: nervous system disorders | ||
type: boolean | ||
names: | ||
- noun: nervous system disorders | ||
- noun: disorders of the nervous system | ||
- id: injury_poisoning_and_procedural_complications | ||
description: injury, poisoning and procedural complications | ||
type: boolean | ||
names: | ||
- noun: injury, poisoning and procedural complications | ||
- noun: injuries and poisonings | ||
license: CC BY 4.0 | ||
links: | ||
- url: https://academic.oup.com/nar/article/44/D1/D1075/2502602?login=false | ||
description: corresponding publication | ||
- url: https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/sider.csv.gz | ||
description: Data source | ||
num_points: 1427 | ||
bibtex: | ||
- |- | ||
@article{10.1093/nar/gkv1075, | ||
author = {Kuhn, Michael and Letunic, Ivica and Jensen, Lars Juhl and Bork, Peer}, | ||
title = "{The SIDER database of drugs and side effects}", | ||
journal = {Nucleic Acids Research}, | ||
volume = {44}, | ||
number = {D1}, | ||
pages = {D1075-D1079}, | ||
year = {2015}, | ||
month = {10}, | ||
issn = {0305-1048}, | ||
doi = {10.1093/nar/gkv1075}, | ||
url = {https://doi.org/10.1093/nar/gkv1075}, | ||
} | ||
templates: | ||
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description} {#representation of |!}{SMILES#} is {hepatobiliary_disorders#not | ||
a &a }{#potential cause|potential reason!} for {hepatobiliary_disorders__names__noun}. | ||
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description} {#representation of |!}{SMILES#} is {metabolism_and_nutrition_disorders#not | ||
a &a }{#potential cause|potential reason!} for {metabolism_and_nutrition_disorders__names__noun}. | ||
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description} {#representation of |!}{SMILES#} is {eye_disorders#not | ||
a &a }{#potential cause|potential reason!} for {eye_disorders__names__noun}. | ||
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description} {#representation of |!}{SMILES#} is {musculoskeletal_and_connective_tissue_disorders#not | ||
a &a }{#potential cause|potential reason!} for {musculoskeletal_and_connective_tissue_disorders__names__noun}. | ||
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description} {#representation of |!}{SMILES#} is {gastrointestinal_disorders#not | ||
a &a }{#potential cause|potential reason!} for {gastrointestinal_disorders__names__noun}. | ||
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description} {#representation of |!}{SMILES#} is {immune_system_disorders#not | ||
a &a }{#potential cause|potential reason!} for {immune_system_disorders__names__noun}. | ||
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description} {#representation of |!}{SMILES#} is {reproductive_system_and_breast_disorders#not | ||
a &a }{#potential cause|potential reason!} for {reproductive_system_and_breast_disorders__names__noun}. | ||
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description} {#representation of |!}{SMILES#} is {neoplasms_benign_malignant_and_unspecified_(incl_cysts_and_polyps)#not | ||
a &a }{#potential cause|potential reason!} for {neoplasms_benign_malignant_and_unspecified_(incl_cysts_and_polyps)__names__noun}. | ||
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description} {#representation of |!}{SMILES#} is {general_disorders_and_administration_site_conditions#not | ||
a &a }{#potential cause|potential reason!} for {general_disorders_and_administration_site_conditions__names__noun}. | ||
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description} {#representation of |!}{SMILES#} is {endocrine_disorders#not | ||
a &a }{#potential cause|potential reason!} for {endocrine_disorders__names__noun}. | ||
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description} {#representation of |!}{SMILES#} is {surgical_and_medical_procedures#not | ||
a &a }{#potential cause|potential reason!} for {surgical_and_medical_procedures__names__noun}. | ||
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description} {#representation of |!}{SMILES#} is {vascular_disorders#not | ||
a &a }{#potential cause|potential reason!} for {vascular_disorders__names__noun}. | ||
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description} {#representation of |!}{SMILES#} is {blood_and_lymphatic_system_disorders#not | ||
a &a }{#potential cause|potential reason!} for {blood_and_lymphatic_system_disorders__names__noun}. | ||
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description} {#representation of |!}{SMILES#} is {skin_and_subcutaneous_tissue_disorders#not | ||
a &a }{#potential cause|potential reason!} for {skin_and_subcutaneous_tissue_disorders__names__noun}. | ||
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description} {#representation of |!}{SMILES#} is {congenital_familial_and_genetic_disorders#not | ||
a &a }{#potential cause|potential reason!} for {congenital_familial_and_genetic_disorders__names__noun}. | ||
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description} {#representation of |!}{SMILES#} is {infections_and_infestations#not | ||
a &a }{#potential cause|potential reason!} for {infections_and_infestations__names__noun}. | ||
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description} {#representation of |!}{SMILES#} is {respiratory_thoracic_and_mediastinal_disorders#not | ||
a &a }{#potential cause|potential reason!} for {respiratory_thoracic_and_mediastinal_disorders__names__noun}. | ||
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description} {#representation of |!}{SMILES#} is {psychiatric_disorders#not | ||
a &a }{#potential cause|potential reason!} for {psychiatric_disorders__names__noun}. | ||
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description} {#representation of |!}{SMILES#} is {renal_and_urinary_disorders#not | ||
a &a }{#potential cause|potential reason!} for {renal_and_urinary_disorders__names__noun}. | ||
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description} {#representation of |!}{SMILES#} is {pregnancy_puerperium_and_perinatal_conditions#not | ||
a &a }{#potential cause|potential reason!} for {pregnancy_puerperium_and_perinatal_conditions__names__noun}. | ||
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description} {#representation of |!}{SMILES#} is {ear_and_labyrinth_disorders#not | ||
a &a }{#potential cause|potential reason!} for {ear_and_labyrinth_disorders__names__noun}. | ||
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description} {#representation of |!}{SMILES#} is {cardiac_disorders#not | ||
a &a }{#potential cause|potential reason!} for {cardiac_disorders__names__noun}. | ||
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description} {#representation of |!}{SMILES#} is {nervous_system_disorders#not | ||
a &a }{#potential cause|potential reason!} for {nervous_system_disorders__names__noun}. | ||
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description} {#representation of |!}{SMILES#} is {injury_poisoning_and_procedural_complications#not | ||
a &a }{#potential cause|potential reason!} for {injury_poisoning_and_procedural_complications__names__noun}. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,148 @@ | ||
from typing import List, Tuple | ||
|
||
import pandas as pd | ||
import yaml | ||
|
||
ALT_DESCRIPTIONS = [ | ||
"Liver and Gallbladder Disorders", | ||
"Metabolic and Nutritional Disorders", | ||
"Ophthalmic Disorders", | ||
"Muscle and Joint Disorders", | ||
"Digestive System Disorders", | ||
"Disorders of the Immune System", | ||
"Disorders of the breasts and the Reproductive system", | ||
"Benign and Malignant Tumors (including Cysts and Polyps)", | ||
"General Health and Administration Site Conditions", | ||
"Endocrine System Disorders", | ||
"Medical and Surgical Procedures", | ||
"Vascular System Disorders", | ||
"Disorders of the blood and lymphatic system", | ||
"Disorders of the Skin and Subcutaneous Tissue", | ||
"Familial, Congenital and Genetic Disorders", | ||
"Infestations and Infections", | ||
"Respiratory and Thoracic Disorders", | ||
"Mental Health and Psychiatric Disorders", | ||
"Kidney and Urinary Tract Disorders", | ||
"Pregnancy, Childbirth, and Newborn Conditions", | ||
"Ear and Inner Ear Disorders", | ||
"Cardiovascular Disorders", | ||
"Disorders of the Nervous System", | ||
"Injuries and Poisonings", | ||
] | ||
|
||
|
||
def load_dataset() -> pd.DataFrame: | ||
sider = pd.read_csv( | ||
"https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/sider.csv.gz" | ||
) | ||
sider = sider.drop( | ||
columns=["Product issues", "Social circumstances", "Investigations"] | ||
) | ||
return sider | ||
|
||
|
||
def transform_data() -> Tuple[pd.DataFrame, pd.Index]: | ||
sider = load_dataset() | ||
old_columns = sider.columns.str.lower() | ||
sider.columns = sider.columns.str.lower().str.replace(" ", "_").str.replace(",", "") | ||
sider = sider.rename(columns={"smiles": "SMILES"}) | ||
sider.to_csv("data_clean.csv", index=False) | ||
|
||
return sider, old_columns | ||
|
||
|
||
def write_meta(column_ids: pd.Index, descriptions: List[str], num_points: int) -> None: | ||
# Write metadata | ||
targets = [ | ||
{ | ||
"id": f"{col_id}", | ||
"description": f"{description}", | ||
"type": "boolean", | ||
"names": [ | ||
{"noun": f"{description}".lower()}, | ||
{"noun": f"{alt_desc}".lower()}, | ||
], | ||
} | ||
for col_id, description, alt_desc in zip( | ||
column_ids[1:], descriptions[1:], ALT_DESCRIPTIONS | ||
) | ||
] | ||
|
||
templates = [ | ||
"The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description}" # noqa: E501 | ||
+ " {#representation of |!}{SMILES#} is {" | ||
+ col_id | ||
+ "#not a &a }" | ||
+ "{#potential cause|potential reason!} for {" | ||
+ col_id | ||
+ "__names__noun}." # noqa: E501 | ||
for col_id in column_ids[1:] | ||
] | ||
|
||
meta = { | ||
"name": "SIDER", # unique identifier, we will also use this for directory names | ||
"description": f"""Database of marketed drugs and adverse drug reactions (ADR), grouped into {len(column_ids[1:])} system organ classes.""", # noqa: E501 | ||
"identifiers": [ | ||
{ | ||
"id": "SMILES", # column name | ||
"type": "SMILES", | ||
"description": "SMILES", # description (optional, except for "Other") | ||
} | ||
], | ||
"targets": targets, | ||
"license": "CC BY 4.0", # license under which the original dataset was published | ||
"links": [ # list of relevant links (original dataset, other uses, etc.) | ||
{ | ||
"url": "https://academic.oup.com/nar/article/44/D1/D1075/2502602?login=false", | ||
"description": "corresponding publication", | ||
}, | ||
{ | ||
"url": "https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/sider.csv.gz", | ||
"description": "Data source", | ||
}, | ||
], | ||
"num_points": num_points, # number of datapoints in this dataset | ||
"bibtex": [ | ||
"""@article{10.1093/nar/gkv1075, | ||
author = {Kuhn, Michael and Letunic, Ivica and Jensen, Lars Juhl and Bork, Peer}, | ||
title = "{The SIDER database of drugs and side effects}", | ||
journal = {Nucleic Acids Research}, | ||
volume = {44}, | ||
number = {D1}, | ||
pages = {D1075-D1079}, | ||
year = {2015}, | ||
month = {10}, | ||
issn = {0305-1048}, | ||
doi = {10.1093/nar/gkv1075}, | ||
url = {https://doi.org/10.1093/nar/gkv1075}, | ||
}""", | ||
], | ||
"templates": templates, | ||
} | ||
|
||
def str_presenter(dumper, data): | ||
"""configures yaml for dumping multiline strings | ||
Ref: https://stackoverflow.com/questions/8640959/how-can-i-control-what-scalar-form-pyyaml-uses-for-my-data | ||
""" | ||
if data.count("\n") > 0: # check for multiline string | ||
return dumper.represent_scalar("tag:yaml.org,2002:str", data, style="|") | ||
return dumper.represent_scalar("tag:yaml.org,2002:str", data) | ||
|
||
yaml.add_representer(str, str_presenter) | ||
yaml.representer.SafeRepresenter.add_representer( | ||
str, str_presenter | ||
) # to use with safe_dum | ||
fn_meta = "meta.yaml" | ||
with open(fn_meta, "w") as f: | ||
yaml.dump(meta, f, sort_keys=False) | ||
|
||
print(f"Finished processing {meta['name']} dataset!") | ||
|
||
|
||
def main(): | ||
sider, old_columns = transform_data() | ||
write_meta(sider.columns, old_columns, len(sider)) | ||
|
||
|
||
if __name__ == "__main__": | ||
main() |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm still not sure about this. You put in SMILES and then predict something about "procedural complications" which sounds more like surgery
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AdrianM0, what are your thoughts on this one?