Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add SIDER #438

Merged
merged 13 commits into from
Nov 1, 2023
220 changes: 220 additions & 0 deletions data/tabular/SIDER/meta.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,220 @@
---
name: SIDER
description: Database of marketed drugs and adverse drug reactions (ADR), grouped into 27 system organ classes.
identifiers:
- id: SMILES
type: SMILES
description: SMILES
targets:
- id: hepatobiliary_disorders
description: hepatobiliary disorders
type: boolean
names:
- noun: hepatobiliary disorders
- id: metabolism_and_nutrition_disorders
description: metabolism and nutrition disorders
type: boolean
names:
- noun: metabolism and nutrition disorders
- id: product_issues
description: product issues
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does this mean? what is an example of a product issue?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AdrianM0 do you know what product issues is supposed to mean? Does this noun fit the column name?

type: boolean
names:
- noun: product issues
- id: eye_disorders
description: eye disorders
type: boolean
names:
- noun: eye disorders
- id: investigations
description: investigations
type: boolean
names:
- noun: investigations
- id: musculoskeletal_and_connective_tissue_disorders
description: musculoskeletal and connective tissue disorders
type: boolean
names:
- noun: musculoskeletal and connective tissue disorders
- id: gastrointestinal_disorders
description: gastrointestinal disorders
type: boolean
names:
- noun: gastrointestinal disorders
- id: social_circumstances
Copy link
Collaborator

@kjappelbaum kjappelbaum Oct 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a disorder?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or is this some kind of disorder?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kjappelbaum I can also remove these two?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also an option if you cannot figure out what those are about

description: social circumstances
type: boolean
names:
- noun: social circumstances
- id: immune_system_disorders
description: immune system disorders
type: boolean
names:
- noun: immune system disorders
- id: reproductive_system_and_breast_disorders
description: reproductive system and breast disorders
type: boolean
names:
- noun: reproductive system and breast disorders
- id: neoplasms_benign_malignant_and_unspecified_(incl_cysts_and_polyps)
description: neoplasms benign, malignant and unspecified (incl cysts and polyps)
type: boolean
names:
- noun: neoplasms benign, malignant and unspecified (incl cysts and polyps)
- id: general_disorders_and_administration_site_conditions
description: general disorders and administration site conditions
type: boolean
names:
- noun: general disorders and administration site conditions
- id: endocrine_disorders
description: endocrine disorders
type: boolean
names:
- noun: endocrine disorders
- id: surgical_and_medical_procedures
description: surgical and medical procedures
type: boolean
names:
- noun: surgical and medical procedures
- id: vascular_disorders
description: vascular disorders
type: boolean
names:
- noun: vascular disorders
- id: blood_and_lymphatic_system_disorders
description: blood and lymphatic system disorders
type: boolean
names:
- noun: blood and lymphatic system disorders
- id: skin_and_subcutaneous_tissue_disorders
description: skin and subcutaneous tissue disorders
type: boolean
names:
- noun: skin and subcutaneous tissue disorders
- id: congenital_familial_and_genetic_disorders
description: congenital, familial and genetic disorders
type: boolean
names:
- noun: congenital, familial and genetic disorders
- id: infections_and_infestations
description: infections and infestations
type: boolean
names:
- noun: infections and infestations
- id: respiratory_thoracic_and_mediastinal_disorders
description: respiratory, thoracic and mediastinal disorders
type: boolean
names:
- noun: respiratory, thoracic and mediastinal disorders
- id: psychiatric_disorders
description: psychiatric disorders
type: boolean
names:
- noun: psychiatric disorders
- id: renal_and_urinary_disorders
description: renal and urinary disorders
type: boolean
names:
- noun: renal and urinary disorders
- id: pregnancy_puerperium_and_perinatal_conditions
description: pregnancy, puerperium and perinatal conditions
type: boolean
names:
- noun: pregnancy, puerperium and perinatal conditions
- id: ear_and_labyrinth_disorders
description: ear and labyrinth disorders
type: boolean
names:
- noun: ear and labyrinth disorders
- id: cardiac_disorders
description: cardiac disorders
type: boolean
names:
- noun: cardiac disorders
- id: nervous_system_disorders
description: nervous system disorders
type: boolean
names:
- noun: nervous system disorders
- id: injury_poisoning_and_procedural_complications
description: injury, poisoning and procedural complications
type: boolean
names:
- noun: injury, poisoning and procedural complications
license: CC BY 4.0
links:
- url: https://academic.oup.com/nar/article/44/D1/D1075/2502602?login=false
description: corresponding publication
- url: https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/sider.csv.gz
description: Data source
num_points: 1427
bibtex:
- |-
@article{10.1093/nar/gkv1075,
author = {Kuhn, Michael and Letunic, Ivica and Jensen, Lars Juhl and Bork, Peer},
title = "{The SIDER database of drugs and side effects}",
journal = {Nucleic Acids Research},
volume = {44},
number = {D1},
pages = {D1075-D1079},
year = {2015},
month = {10},
issn = {0305-1048},
doi = {10.1093/nar/gkv1075},
url = {https://doi.org/10.1093/nar/gkv1075},
}
templates:
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description}{#representation of |!}{SMILES#} is {hepatobiliary_disorders#not
a &a }{#potential cause|potential reason!} for {hepatobiliary_disorders__names__noun}.
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description}{#representation of |!}{SMILES#} is {metabolism_and_nutrition_disorders#not
a &a }{#potential cause|potential reason!} for {metabolism_and_nutrition_disorders__names__noun}.
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description}{#representation of |!}{SMILES#} is {product_issues#not
a &a }{#potential cause|potential reason!} for {product_issues__names__noun}.
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description}{#representation of |!}{SMILES#} is {eye_disorders#not
a &a }{#potential cause|potential reason!} for {eye_disorders__names__noun}.
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description}{#representation of |!}{SMILES#} is {investigations#not
a &a }{#potential cause|potential reason!} for {investigations__names__noun}.
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description}{#representation of |!}{SMILES#} is {musculoskeletal_and_connective_tissue_disorders#not
a &a }{#potential cause|potential reason!} for {musculoskeletal_and_connective_tissue_disorders__names__noun}.
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description}{#representation of |!}{SMILES#} is {gastrointestinal_disorders#not
a &a }{#potential cause|potential reason!} for {gastrointestinal_disorders__names__noun}.
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description}{#representation of |!}{SMILES#} is {social_circumstances#not
a &a }{#potential cause|potential reason!} for {social_circumstances__names__noun}.
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description}{#representation of |!}{SMILES#} is {immune_system_disorders#not
a &a }{#potential cause|potential reason!} for {immune_system_disorders__names__noun}.
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description}{#representation of |!}{SMILES#} is {reproductive_system_and_breast_disorders#not
a &a }{#potential cause|potential reason!} for {reproductive_system_and_breast_disorders__names__noun}.
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description}{#representation of |!}{SMILES#} is {neoplasms_benign_malignant_and_unspecified_(incl_cysts_and_polyps)#not
a &a }{#potential cause|potential reason!} for {neoplasms_benign_malignant_and_unspecified_(incl_cysts_and_polyps)__names__noun}.
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description}{#representation of |!}{SMILES#} is {general_disorders_and_administration_site_conditions#not
a &a }{#potential cause|potential reason!} for {general_disorders_and_administration_site_conditions__names__noun}.
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description}{#representation of |!}{SMILES#} is {endocrine_disorders#not
a &a }{#potential cause|potential reason!} for {endocrine_disorders__names__noun}.
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description}{#representation of |!}{SMILES#} is {surgical_and_medical_procedures#not
a &a }{#potential cause|potential reason!} for {surgical_and_medical_procedures__names__noun}.
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description}{#representation of |!}{SMILES#} is {vascular_disorders#not
a &a }{#potential cause|potential reason!} for {vascular_disorders__names__noun}.
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description}{#representation of |!}{SMILES#} is {blood_and_lymphatic_system_disorders#not
a &a }{#potential cause|potential reason!} for {blood_and_lymphatic_system_disorders__names__noun}.
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description}{#representation of |!}{SMILES#} is {skin_and_subcutaneous_tissue_disorders#not
a &a }{#potential cause|potential reason!} for {skin_and_subcutaneous_tissue_disorders__names__noun}.
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description}{#representation of |!}{SMILES#} is {congenital_familial_and_genetic_disorders#not
a &a }{#potential cause|potential reason!} for {congenital_familial_and_genetic_disorders__names__noun}.
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description}{#representation of |!}{SMILES#} is {infections_and_infestations#not
a &a }{#potential cause|potential reason!} for {infections_and_infestations__names__noun}.
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description}{#representation of |!}{SMILES#} is {respiratory_thoracic_and_mediastinal_disorders#not
a &a }{#potential cause|potential reason!} for {respiratory_thoracic_and_mediastinal_disorders__names__noun}.
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description}{#representation of |!}{SMILES#} is {psychiatric_disorders#not
a &a }{#potential cause|potential reason!} for {psychiatric_disorders__names__noun}.
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description}{#representation of |!}{SMILES#} is {renal_and_urinary_disorders#not
a &a }{#potential cause|potential reason!} for {renal_and_urinary_disorders__names__noun}.
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description}{#representation of |!}{SMILES#} is {pregnancy_puerperium_and_perinatal_conditions#not
a &a }{#potential cause|potential reason!} for {pregnancy_puerperium_and_perinatal_conditions__names__noun}.
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description}{#representation of |!}{SMILES#} is {ear_and_labyrinth_disorders#not
a &a }{#potential cause|potential reason!} for {ear_and_labyrinth_disorders__names__noun}.
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description}{#representation of |!}{SMILES#} is {cardiac_disorders#not
a &a }{#potential cause|potential reason!} for {cardiac_disorders__names__noun}.
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description}{#representation of |!}{SMILES#} is {nervous_system_disorders#not
a &a }{#potential cause|potential reason!} for {nervous_system_disorders__names__noun}.
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description}{#representation of |!}{SMILES#} is {injury_poisoning_and_procedural_complications#not
a &a }{#potential cause|potential reason!} for {injury_poisoning_and_procedural_complications__names__noun}.
115 changes: 115 additions & 0 deletions data/tabular/SIDER/transform.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
from typing import List, Tuple

import pandas as pd
import yaml


def load_dataset() -> pd.DataFrame:
sider = pd.read_csv(
"https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/sider.csv.gz"
)
return sider


def transform_data() -> Tuple[pd.DataFrame, pd.Index]:
sider = load_dataset()
old_columns = sider.columns.str.lower()
sider.columns = sider.columns.str.lower().str.replace(" ", "_").str.replace(",", "")
sider = sider.rename(columns={"smiles": "SMILES"})
sider.to_csv("data_clean.csv", index=False)

return sider, old_columns


def write_meta(column_ids: pd.Index, descriptions: List[str], num_points: int) -> None:
# Write metadata
targets = [
{
"id": f"{col_id}",
"description": f"{description}",
"type": "boolean",
"names": [{"noun": f"{description}".lower()}],
}
for col_id, description in zip(column_ids[1:], descriptions[1:])
]

templates = [
"The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description}" # noqa: E501
+ "{#representation of |!}{SMILES#} is {"
kjappelbaum marked this conversation as resolved.
Show resolved Hide resolved
+ col_id
+ "#not a &a }"
+ "{#potential cause|potential reason!} for {"
+ col_id
+ "__names__noun}." # noqa: E501
for col_id in column_ids[1:]
]

meta = {
"name": "SIDER", # unique identifier, we will also use this for directory names
"description": """Database of marketed drugs and adverse drug reactions (ADR), grouped into 27 system organ classes.""", # noqa: E501
"identifiers": [
{
"id": "SMILES", # column name
"type": "SMILES",
"description": "SMILES", # description (optional, except for "Other")
}
],
"targets": targets,
"license": "CC BY 4.0", # license under which the original dataset was published
"links": [ # list of relevant links (original dataset, other uses, etc.)
{
"url": "https://academic.oup.com/nar/article/44/D1/D1075/2502602?login=false",
"description": "corresponding publication",
},
{
"url": "https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/sider.csv.gz",
"description": "Data source",
},
],
"num_points": num_points, # number of datapoints in this dataset
"bibtex": [
"""@article{10.1093/nar/gkv1075,
author = {Kuhn, Michael and Letunic, Ivica and Jensen, Lars Juhl and Bork, Peer},
title = "{The SIDER database of drugs and side effects}",
journal = {Nucleic Acids Research},
volume = {44},
number = {D1},
pages = {D1075-D1079},
year = {2015},
month = {10},
issn = {0305-1048},
doi = {10.1093/nar/gkv1075},
url = {https://doi.org/10.1093/nar/gkv1075},
}""",
],
"templates": templates,
}

def str_presenter(dumper, data):
"""configures yaml for dumping multiline strings
Ref: https://stackoverflow.com/questions/8640959/how-can-i-control-what-scalar-form-pyyaml-uses-for-my-data
"""
if data.count("\n") > 0: # check for multiline string
return dumper.represent_scalar("tag:yaml.org,2002:str", data, style="|")
return dumper.represent_scalar("tag:yaml.org,2002:str", data)

yaml.add_representer(str, str_presenter)
yaml.representer.SafeRepresenter.add_representer(
str, str_presenter
) # to use with safe_dum
fn_meta = "meta.yaml"
with open(fn_meta, "w") as f:
yaml.dump(meta, f, sort_keys=False)

print(f"Finished processing {meta['name']} dataset!")

return


def main():
sider, old_columns = transform_data()
write_meta(sider.columns, old_columns, len(sider))


if __name__ == "__main__":
main()