-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add SIDER #438
add SIDER #438
Changes from 4 commits
b255e6f
c13f15c
7a05881
e929eb5
36f44ff
454928b
d0334b8
e039c05
2ceb67a
19b1942
72b49e6
c1d5d2d
f1d2270
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,220 @@ | ||
--- | ||
name: SIDER | ||
description: Database of marketed drugs and adverse drug reactions (ADR), grouped into 27 system organ classes. | ||
identifiers: | ||
- id: SMILES | ||
type: SMILES | ||
description: SMILES | ||
targets: | ||
- id: hepatobiliary_disorders | ||
description: hepatobiliary disorders | ||
type: boolean | ||
names: | ||
- noun: hepatobiliary disorders | ||
- id: metabolism_and_nutrition_disorders | ||
description: metabolism and nutrition disorders | ||
type: boolean | ||
names: | ||
- noun: metabolism and nutrition disorders | ||
- id: product_issues | ||
description: product issues | ||
type: boolean | ||
names: | ||
- noun: product issues | ||
- id: eye_disorders | ||
description: eye disorders | ||
type: boolean | ||
names: | ||
- noun: eye disorders | ||
- id: investigations | ||
description: investigations | ||
type: boolean | ||
names: | ||
- noun: investigations | ||
- id: musculoskeletal_and_connective_tissue_disorders | ||
description: musculoskeletal and connective tissue disorders | ||
type: boolean | ||
names: | ||
- noun: musculoskeletal and connective tissue disorders | ||
- id: gastrointestinal_disorders | ||
description: gastrointestinal disorders | ||
type: boolean | ||
names: | ||
- noun: gastrointestinal disorders | ||
- id: social_circumstances | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is this a disorder? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. or is this some kind of disorder? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @kjappelbaum I can also remove these two? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. also an option if you cannot figure out what those are about |
||
description: social circumstances | ||
type: boolean | ||
names: | ||
- noun: social circumstances | ||
- id: immune_system_disorders | ||
description: immune system disorders | ||
type: boolean | ||
names: | ||
- noun: immune system disorders | ||
- id: reproductive_system_and_breast_disorders | ||
description: reproductive system and breast disorders | ||
type: boolean | ||
names: | ||
- noun: reproductive system and breast disorders | ||
- id: neoplasms_benign_malignant_and_unspecified_(incl_cysts_and_polyps) | ||
description: neoplasms benign, malignant and unspecified (incl cysts and polyps) | ||
type: boolean | ||
names: | ||
- noun: neoplasms benign, malignant and unspecified (incl cysts and polyps) | ||
- id: general_disorders_and_administration_site_conditions | ||
description: general disorders and administration site conditions | ||
type: boolean | ||
names: | ||
- noun: general disorders and administration site conditions | ||
- id: endocrine_disorders | ||
description: endocrine disorders | ||
type: boolean | ||
names: | ||
- noun: endocrine disorders | ||
- id: surgical_and_medical_procedures | ||
description: surgical and medical procedures | ||
type: boolean | ||
names: | ||
- noun: surgical and medical procedures | ||
- id: vascular_disorders | ||
description: vascular disorders | ||
type: boolean | ||
names: | ||
- noun: vascular disorders | ||
- id: blood_and_lymphatic_system_disorders | ||
description: blood and lymphatic system disorders | ||
type: boolean | ||
names: | ||
- noun: blood and lymphatic system disorders | ||
- id: skin_and_subcutaneous_tissue_disorders | ||
description: skin and subcutaneous tissue disorders | ||
type: boolean | ||
names: | ||
- noun: skin and subcutaneous tissue disorders | ||
- id: congenital_familial_and_genetic_disorders | ||
description: congenital, familial and genetic disorders | ||
type: boolean | ||
names: | ||
- noun: congenital, familial and genetic disorders | ||
- id: infections_and_infestations | ||
description: infections and infestations | ||
type: boolean | ||
names: | ||
- noun: infections and infestations | ||
- id: respiratory_thoracic_and_mediastinal_disorders | ||
description: respiratory, thoracic and mediastinal disorders | ||
type: boolean | ||
names: | ||
- noun: respiratory, thoracic and mediastinal disorders | ||
- id: psychiatric_disorders | ||
description: psychiatric disorders | ||
type: boolean | ||
names: | ||
- noun: psychiatric disorders | ||
- id: renal_and_urinary_disorders | ||
description: renal and urinary disorders | ||
type: boolean | ||
names: | ||
- noun: renal and urinary disorders | ||
- id: pregnancy_puerperium_and_perinatal_conditions | ||
description: pregnancy, puerperium and perinatal conditions | ||
type: boolean | ||
names: | ||
- noun: pregnancy, puerperium and perinatal conditions | ||
- id: ear_and_labyrinth_disorders | ||
description: ear and labyrinth disorders | ||
type: boolean | ||
names: | ||
- noun: ear and labyrinth disorders | ||
- id: cardiac_disorders | ||
description: cardiac disorders | ||
type: boolean | ||
names: | ||
- noun: cardiac disorders | ||
- id: nervous_system_disorders | ||
description: nervous system disorders | ||
type: boolean | ||
names: | ||
- noun: nervous system disorders | ||
- id: injury_poisoning_and_procedural_complications | ||
description: injury, poisoning and procedural complications | ||
type: boolean | ||
names: | ||
- noun: injury, poisoning and procedural complications | ||
license: CC BY 4.0 | ||
links: | ||
- url: https://academic.oup.com/nar/article/44/D1/D1075/2502602?login=false | ||
description: corresponding publication | ||
- url: https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/sider.csv.gz | ||
description: Data source | ||
num_points: 1427 | ||
bibtex: | ||
- |- | ||
@article{10.1093/nar/gkv1075, | ||
author = {Kuhn, Michael and Letunic, Ivica and Jensen, Lars Juhl and Bork, Peer}, | ||
title = "{The SIDER database of drugs and side effects}", | ||
journal = {Nucleic Acids Research}, | ||
volume = {44}, | ||
number = {D1}, | ||
pages = {D1075-D1079}, | ||
year = {2015}, | ||
month = {10}, | ||
issn = {0305-1048}, | ||
doi = {10.1093/nar/gkv1075}, | ||
url = {https://doi.org/10.1093/nar/gkv1075}, | ||
} | ||
templates: | ||
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description}{#representation of |!}{SMILES#} is {hepatobiliary_disorders#not | ||
a &a }{#potential cause|potential reason!} for {hepatobiliary_disorders__names__noun}. | ||
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description}{#representation of |!}{SMILES#} is {metabolism_and_nutrition_disorders#not | ||
a &a }{#potential cause|potential reason!} for {metabolism_and_nutrition_disorders__names__noun}. | ||
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description}{#representation of |!}{SMILES#} is {product_issues#not | ||
a &a }{#potential cause|potential reason!} for {product_issues__names__noun}. | ||
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description}{#representation of |!}{SMILES#} is {eye_disorders#not | ||
a &a }{#potential cause|potential reason!} for {eye_disorders__names__noun}. | ||
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description}{#representation of |!}{SMILES#} is {investigations#not | ||
a &a }{#potential cause|potential reason!} for {investigations__names__noun}. | ||
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description}{#representation of |!}{SMILES#} is {musculoskeletal_and_connective_tissue_disorders#not | ||
a &a }{#potential cause|potential reason!} for {musculoskeletal_and_connective_tissue_disorders__names__noun}. | ||
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description}{#representation of |!}{SMILES#} is {gastrointestinal_disorders#not | ||
a &a }{#potential cause|potential reason!} for {gastrointestinal_disorders__names__noun}. | ||
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description}{#representation of |!}{SMILES#} is {social_circumstances#not | ||
a &a }{#potential cause|potential reason!} for {social_circumstances__names__noun}. | ||
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description}{#representation of |!}{SMILES#} is {immune_system_disorders#not | ||
a &a }{#potential cause|potential reason!} for {immune_system_disorders__names__noun}. | ||
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description}{#representation of |!}{SMILES#} is {reproductive_system_and_breast_disorders#not | ||
a &a }{#potential cause|potential reason!} for {reproductive_system_and_breast_disorders__names__noun}. | ||
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description}{#representation of |!}{SMILES#} is {neoplasms_benign_malignant_and_unspecified_(incl_cysts_and_polyps)#not | ||
a &a }{#potential cause|potential reason!} for {neoplasms_benign_malignant_and_unspecified_(incl_cysts_and_polyps)__names__noun}. | ||
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description}{#representation of |!}{SMILES#} is {general_disorders_and_administration_site_conditions#not | ||
a &a }{#potential cause|potential reason!} for {general_disorders_and_administration_site_conditions__names__noun}. | ||
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description}{#representation of |!}{SMILES#} is {endocrine_disorders#not | ||
a &a }{#potential cause|potential reason!} for {endocrine_disorders__names__noun}. | ||
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description}{#representation of |!}{SMILES#} is {surgical_and_medical_procedures#not | ||
a &a }{#potential cause|potential reason!} for {surgical_and_medical_procedures__names__noun}. | ||
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description}{#representation of |!}{SMILES#} is {vascular_disorders#not | ||
a &a }{#potential cause|potential reason!} for {vascular_disorders__names__noun}. | ||
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description}{#representation of |!}{SMILES#} is {blood_and_lymphatic_system_disorders#not | ||
a &a }{#potential cause|potential reason!} for {blood_and_lymphatic_system_disorders__names__noun}. | ||
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description}{#representation of |!}{SMILES#} is {skin_and_subcutaneous_tissue_disorders#not | ||
a &a }{#potential cause|potential reason!} for {skin_and_subcutaneous_tissue_disorders__names__noun}. | ||
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description}{#representation of |!}{SMILES#} is {congenital_familial_and_genetic_disorders#not | ||
a &a }{#potential cause|potential reason!} for {congenital_familial_and_genetic_disorders__names__noun}. | ||
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description}{#representation of |!}{SMILES#} is {infections_and_infestations#not | ||
a &a }{#potential cause|potential reason!} for {infections_and_infestations__names__noun}. | ||
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description}{#representation of |!}{SMILES#} is {respiratory_thoracic_and_mediastinal_disorders#not | ||
a &a }{#potential cause|potential reason!} for {respiratory_thoracic_and_mediastinal_disorders__names__noun}. | ||
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description}{#representation of |!}{SMILES#} is {psychiatric_disorders#not | ||
a &a }{#potential cause|potential reason!} for {psychiatric_disorders__names__noun}. | ||
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description}{#representation of |!}{SMILES#} is {renal_and_urinary_disorders#not | ||
a &a }{#potential cause|potential reason!} for {renal_and_urinary_disorders__names__noun}. | ||
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description}{#representation of |!}{SMILES#} is {pregnancy_puerperium_and_perinatal_conditions#not | ||
a &a }{#potential cause|potential reason!} for {pregnancy_puerperium_and_perinatal_conditions__names__noun}. | ||
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description}{#representation of |!}{SMILES#} is {ear_and_labyrinth_disorders#not | ||
a &a }{#potential cause|potential reason!} for {ear_and_labyrinth_disorders__names__noun}. | ||
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description}{#representation of |!}{SMILES#} is {cardiac_disorders#not | ||
a &a }{#potential cause|potential reason!} for {cardiac_disorders__names__noun}. | ||
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description}{#representation of |!}{SMILES#} is {nervous_system_disorders#not | ||
a &a }{#potential cause|potential reason!} for {nervous_system_disorders__names__noun}. | ||
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description}{#representation of |!}{SMILES#} is {injury_poisoning_and_procedural_complications#not | ||
a &a }{#potential cause|potential reason!} for {injury_poisoning_and_procedural_complications__names__noun}. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,115 @@ | ||
from typing import List, Tuple | ||
|
||
import pandas as pd | ||
import yaml | ||
|
||
|
||
def load_dataset() -> pd.DataFrame: | ||
sider = pd.read_csv( | ||
"https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/sider.csv.gz" | ||
) | ||
return sider | ||
|
||
|
||
def transform_data() -> Tuple[pd.DataFrame, pd.Index]: | ||
sider = load_dataset() | ||
old_columns = sider.columns.str.lower() | ||
sider.columns = sider.columns.str.lower().str.replace(" ", "_").str.replace(",", "") | ||
sider = sider.rename(columns={"smiles": "SMILES"}) | ||
sider.to_csv("data_clean.csv", index=False) | ||
|
||
return sider, old_columns | ||
|
||
|
||
def write_meta(column_ids: pd.Index, descriptions: List[str], num_points: int) -> None: | ||
# Write metadata | ||
targets = [ | ||
{ | ||
"id": f"{col_id}", | ||
"description": f"{description}", | ||
"type": "boolean", | ||
"names": [{"noun": f"{description}".lower()}], | ||
} | ||
for col_id, description in zip(column_ids[1:], descriptions[1:]) | ||
] | ||
|
||
templates = [ | ||
"The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description}" # noqa: E501 | ||
+ "{#representation of |!}{SMILES#} is {" | ||
kjappelbaum marked this conversation as resolved.
Show resolved
Hide resolved
|
||
+ col_id | ||
+ "#not a &a }" | ||
+ "{#potential cause|potential reason!} for {" | ||
+ col_id | ||
+ "__names__noun}." # noqa: E501 | ||
for col_id in column_ids[1:] | ||
] | ||
|
||
meta = { | ||
"name": "SIDER", # unique identifier, we will also use this for directory names | ||
"description": """Database of marketed drugs and adverse drug reactions (ADR), grouped into 27 system organ classes.""", # noqa: E501 | ||
"identifiers": [ | ||
{ | ||
"id": "SMILES", # column name | ||
"type": "SMILES", | ||
"description": "SMILES", # description (optional, except for "Other") | ||
} | ||
], | ||
"targets": targets, | ||
"license": "CC BY 4.0", # license under which the original dataset was published | ||
"links": [ # list of relevant links (original dataset, other uses, etc.) | ||
{ | ||
"url": "https://academic.oup.com/nar/article/44/D1/D1075/2502602?login=false", | ||
"description": "corresponding publication", | ||
}, | ||
{ | ||
"url": "https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/sider.csv.gz", | ||
"description": "Data source", | ||
}, | ||
], | ||
"num_points": num_points, # number of datapoints in this dataset | ||
"bibtex": [ | ||
"""@article{10.1093/nar/gkv1075, | ||
author = {Kuhn, Michael and Letunic, Ivica and Jensen, Lars Juhl and Bork, Peer}, | ||
title = "{The SIDER database of drugs and side effects}", | ||
journal = {Nucleic Acids Research}, | ||
volume = {44}, | ||
number = {D1}, | ||
pages = {D1075-D1079}, | ||
year = {2015}, | ||
month = {10}, | ||
issn = {0305-1048}, | ||
doi = {10.1093/nar/gkv1075}, | ||
url = {https://doi.org/10.1093/nar/gkv1075}, | ||
}""", | ||
], | ||
"templates": templates, | ||
} | ||
|
||
def str_presenter(dumper, data): | ||
"""configures yaml for dumping multiline strings | ||
Ref: https://stackoverflow.com/questions/8640959/how-can-i-control-what-scalar-form-pyyaml-uses-for-my-data | ||
""" | ||
if data.count("\n") > 0: # check for multiline string | ||
return dumper.represent_scalar("tag:yaml.org,2002:str", data, style="|") | ||
return dumper.represent_scalar("tag:yaml.org,2002:str", data) | ||
|
||
yaml.add_representer(str, str_presenter) | ||
yaml.representer.SafeRepresenter.add_representer( | ||
str, str_presenter | ||
) # to use with safe_dum | ||
fn_meta = "meta.yaml" | ||
with open(fn_meta, "w") as f: | ||
yaml.dump(meta, f, sort_keys=False) | ||
|
||
print(f"Finished processing {meta['name']} dataset!") | ||
|
||
return | ||
|
||
|
||
def main(): | ||
sider, old_columns = transform_data() | ||
write_meta(sider.columns, old_columns, len(sider)) | ||
|
||
|
||
if __name__ == "__main__": | ||
main() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does this mean? what is an example of a product issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AdrianM0 do you know what product issues is supposed to mean? Does this noun fit the column name?