Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add SIDER #438

Merged
merged 13 commits into from
Nov 1, 2023
215 changes: 215 additions & 0 deletions data/tabular/SIDER/meta.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,215 @@
---
name: SIDER
description: Database of marketed drugs and adverse drug reactions (ADR), grouped into 23 system organ classes.
identifiers:
- id: SMILES
type: SMILES
description: SMILES
targets:
- id: hepatobiliary_disorders
description: hepatobiliary disorders
type: boolean
names:
- noun: hepatobiliary disorders
- noun: liver and gallbladder disorders
- id: metabolism_and_nutrition_disorders
description: metabolism and nutrition disorders
type: boolean
names:
- noun: metabolism and nutrition disorders
- noun: metabolic and nutritional disorders
- id: eye_disorders
description: eye disorders
type: boolean
names:
- noun: eye disorders
- noun: ophthalmic disorders
- id: musculoskeletal_and_connective_tissue_disorders
description: musculoskeletal and connective tissue disorders
type: boolean
names:
- noun: musculoskeletal and connective tissue disorders
- noun: muscle and joint disorders
- id: gastrointestinal_disorders
description: gastrointestinal disorders
type: boolean
names:
- noun: gastrointestinal disorders
- noun: digestive system disorders
- id: immune_system_disorders
description: immune system disorders
type: boolean
names:
- noun: immune system disorders
- noun: disorders of the immune system
- id: reproductive_system_and_breast_disorders
description: reproductive system and breast disorders
type: boolean
names:
- noun: reproductive system and breast disorders
- noun: disorders of the breasts and the reproductive system
- id: neoplasms_benign_malignant_and_unspecified_(incl_cysts_and_polyps)
description: neoplasms benign, malignant and unspecified (incl cysts and polyps)
type: boolean
names:
- noun: neoplasms benign, malignant and unspecified (incl cysts and polyps)
- noun: benign and malignant tumors (including cysts and polyps)
- id: general_disorders_and_administration_site_conditions
description: general disorders and administration site conditions
type: boolean
names:
- noun: general disorders and administration site conditions
- noun: general health and administration site conditions
- id: endocrine_disorders
description: endocrine disorders
type: boolean
names:
- noun: endocrine disorders
- noun: endocrine system disorders
- id: surgical_and_medical_procedures
description: surgical and medical procedures
type: boolean
names:
- noun: surgical and medical procedures
- noun: medical and surgical procedures
- id: vascular_disorders
description: vascular disorders
type: boolean
names:
- noun: vascular disorders
- noun: vascular system disorders
- id: blood_and_lymphatic_system_disorders
description: blood and lymphatic system disorders
type: boolean
names:
- noun: blood and lymphatic system disorders
- noun: disorders of the blood and lymphatic system
- id: skin_and_subcutaneous_tissue_disorders
description: skin and subcutaneous tissue disorders
type: boolean
names:
- noun: skin and subcutaneous tissue disorders
- noun: disorders of the skin and subcutaneous tissue
- id: congenital_familial_and_genetic_disorders
description: congenital, familial and genetic disorders
type: boolean
names:
- noun: congenital, familial and genetic disorders
- noun: familial, congenital and genetic disorders
- id: infections_and_infestations
description: infections and infestations
type: boolean
names:
- noun: infections and infestations
- noun: infestations and infections
- id: respiratory_thoracic_and_mediastinal_disorders
description: respiratory, thoracic and mediastinal disorders
type: boolean
names:
- noun: respiratory, thoracic and mediastinal disorders
- noun: respiratory and thoracic disorders
- id: psychiatric_disorders
description: psychiatric disorders
type: boolean
names:
- noun: psychiatric disorders
- noun: mental health and psychiatric disorders
- id: renal_and_urinary_disorders
description: renal and urinary disorders
type: boolean
names:
- noun: renal and urinary disorders
- noun: kidney and urinary tract disorders
- id: pregnancy_puerperium_and_perinatal_conditions
description: pregnancy, puerperium and perinatal conditions
type: boolean
names:
- noun: pregnancy, puerperium and perinatal conditions
- noun: pregnancy, childbirth, and newborn conditions
- id: ear_and_labyrinth_disorders
description: ear and labyrinth disorders
type: boolean
names:
- noun: ear and labyrinth disorders
- noun: ear and inner ear disorders
- id: cardiac_disorders
description: cardiac disorders
type: boolean
names:
- noun: cardiac disorders
- noun: cardiovascular disorders
- id: nervous_system_disorders
description: nervous system disorders
type: boolean
names:
- noun: nervous system disorders
- noun: disorders of the nervous system
license: CC BY 4.0
links:
- url: https://academic.oup.com/nar/article/44/D1/D1075/2502602?login=false
description: corresponding publication
- url: https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/sider.csv.gz
description: Data source
num_points: 1427
bibtex:
- |-
@article{10.1093/nar/gkv1075,
author = {Kuhn, Michael and Letunic, Ivica and Jensen, Lars Juhl and Bork, Peer},
title = "{The SIDER database of drugs and side effects}",
journal = {Nucleic Acids Research},
volume = {44},
number = {D1},
pages = {D1075-D1079},
year = {2015},
month = {10},
issn = {0305-1048},
doi = {10.1093/nar/gkv1075},
url = {https://doi.org/10.1093/nar/gkv1075},
}
templates:
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description} {#representation of |!}{SMILES#} is {hepatobiliary_disorders#not
a &a }{#potential cause|potential reason!} for {hepatobiliary_disorders__names__noun}.
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description} {#representation of |!}{SMILES#} is {metabolism_and_nutrition_disorders#not
a &a }{#potential cause|potential reason!} for {metabolism_and_nutrition_disorders__names__noun}.
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description} {#representation of |!}{SMILES#} is {eye_disorders#not
a &a }{#potential cause|potential reason!} for {eye_disorders__names__noun}.
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description} {#representation of |!}{SMILES#} is {musculoskeletal_and_connective_tissue_disorders#not
a &a }{#potential cause|potential reason!} for {musculoskeletal_and_connective_tissue_disorders__names__noun}.
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description} {#representation of |!}{SMILES#} is {gastrointestinal_disorders#not
a &a }{#potential cause|potential reason!} for {gastrointestinal_disorders__names__noun}.
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description} {#representation of |!}{SMILES#} is {immune_system_disorders#not
a &a }{#potential cause|potential reason!} for {immune_system_disorders__names__noun}.
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description} {#representation of |!}{SMILES#} is {reproductive_system_and_breast_disorders#not
a &a }{#potential cause|potential reason!} for {reproductive_system_and_breast_disorders__names__noun}.
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description} {#representation of |!}{SMILES#} is {neoplasms_benign_malignant_and_unspecified_(incl_cysts_and_polyps)#not
a &a }{#potential cause|potential reason!} for {neoplasms_benign_malignant_and_unspecified_(incl_cysts_and_polyps)__names__noun}.
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description} {#representation of |!}{SMILES#} is {general_disorders_and_administration_site_conditions#not
a &a }{#potential cause|potential reason!} for {general_disorders_and_administration_site_conditions__names__noun}.
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description} {#representation of |!}{SMILES#} is {endocrine_disorders#not
a &a }{#potential cause|potential reason!} for {endocrine_disorders__names__noun}.
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description} {#representation of |!}{SMILES#} is {surgical_and_medical_procedures#not
a &a }{#potential cause|potential reason!} for {surgical_and_medical_procedures__names__noun}.
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description} {#representation of |!}{SMILES#} is {vascular_disorders#not
a &a }{#potential cause|potential reason!} for {vascular_disorders__names__noun}.
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description} {#representation of |!}{SMILES#} is {blood_and_lymphatic_system_disorders#not
a &a }{#potential cause|potential reason!} for {blood_and_lymphatic_system_disorders__names__noun}.
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description} {#representation of |!}{SMILES#} is {skin_and_subcutaneous_tissue_disorders#not
a &a }{#potential cause|potential reason!} for {skin_and_subcutaneous_tissue_disorders__names__noun}.
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description} {#representation of |!}{SMILES#} is {congenital_familial_and_genetic_disorders#not
a &a }{#potential cause|potential reason!} for {congenital_familial_and_genetic_disorders__names__noun}.
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description} {#representation of |!}{SMILES#} is {infections_and_infestations#not
a &a }{#potential cause|potential reason!} for {infections_and_infestations__names__noun}.
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description} {#representation of |!}{SMILES#} is {respiratory_thoracic_and_mediastinal_disorders#not
a &a }{#potential cause|potential reason!} for {respiratory_thoracic_and_mediastinal_disorders__names__noun}.
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description} {#representation of |!}{SMILES#} is {psychiatric_disorders#not
a &a }{#potential cause|potential reason!} for {psychiatric_disorders__names__noun}.
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description} {#representation of |!}{SMILES#} is {renal_and_urinary_disorders#not
a &a }{#potential cause|potential reason!} for {renal_and_urinary_disorders__names__noun}.
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description} {#representation of |!}{SMILES#} is {pregnancy_puerperium_and_perinatal_conditions#not
a &a }{#potential cause|potential reason!} for {pregnancy_puerperium_and_perinatal_conditions__names__noun}.
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description} {#representation of |!}{SMILES#} is {ear_and_labyrinth_disorders#not
a &a }{#potential cause|potential reason!} for {ear_and_labyrinth_disorders__names__noun}.
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description} {#representation of |!}{SMILES#} is {cardiac_disorders#not
a &a }{#potential cause|potential reason!} for {cardiac_disorders__names__noun}.
- The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description} {#representation of |!}{SMILES#} is {nervous_system_disorders#not
a &a }{#potential cause|potential reason!} for {nervous_system_disorders__names__noun}.
153 changes: 153 additions & 0 deletions data/tabular/SIDER/transform.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
from typing import List, Tuple

import pandas as pd
import yaml

ALT_DESCRIPTIONS = [
"Liver and Gallbladder Disorders",
"Metabolic and Nutritional Disorders",
"Ophthalmic Disorders",
"Muscle and Joint Disorders",
"Digestive System Disorders",
"Disorders of the Immune System",
"Disorders of the breasts and the Reproductive system",
"Benign and Malignant Tumors (including Cysts and Polyps)",
"General Health and Administration Site Conditions",
"Endocrine System Disorders",
"Medical and Surgical Procedures",
"Vascular System Disorders",
"Disorders of the blood and lymphatic system",
"Disorders of the Skin and Subcutaneous Tissue",
"Familial, Congenital and Genetic Disorders",
"Infestations and Infections",
"Respiratory and Thoracic Disorders",
"Mental Health and Psychiatric Disorders",
"Kidney and Urinary Tract Disorders",
"Pregnancy, Childbirth, and Newborn Conditions",
"Ear and Inner Ear Disorders",
"Cardiovascular Disorders",
"Disorders of the Nervous System",
]


def load_dataset() -> pd.DataFrame:
sider = pd.read_csv(
"https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/sider.csv.gz"
)

sider = sider.drop(
columns=[
"Product issues",
"Social circumstances",
"Investigations",
"Injury, poisoning and procedural complications",
]
)
return sider


def transform_data() -> Tuple[pd.DataFrame, pd.Index]:
sider = load_dataset()
old_columns = sider.columns.str.lower()
sider.columns = sider.columns.str.lower().str.replace(" ", "_").str.replace(",", "")
sider = sider.rename(columns={"smiles": "SMILES"})
sider.to_csv("data_clean.csv", index=False)

return sider, old_columns


def write_meta(column_ids: pd.Index, descriptions: List[str], num_points: int) -> None:
# Write metadata
targets = [
{
"id": f"{col_id}",
"description": f"{description}",
"type": "boolean",
"names": [
{"noun": f"{description}".lower()},
{"noun": f"{alt_desc}".lower()},
],
}
for col_id, description, alt_desc in zip(
column_ids[1:], descriptions[1:], ALT_DESCRIPTIONS
)
]

templates = [
"The {#molecule|compound|chemical|molecular species|chemical compound!} with the {SMILES__description}" # noqa: E501
+ " {#representation of |!}{SMILES#} is {"
+ col_id
+ "#not a &a }"
+ "{#potential cause|potential reason!} for {"
+ col_id
+ "__names__noun}." # noqa: E501
for col_id in column_ids[1:]
]

meta = {
"name": "SIDER", # unique identifier, we will also use this for directory names
"description": f"""Database of marketed drugs and adverse drug reactions (ADR), grouped into {len(column_ids[1:])} system organ classes.""", # noqa: E501
"identifiers": [
{
"id": "SMILES", # column name
"type": "SMILES",
"description": "SMILES", # description (optional, except for "Other")
}
],
"targets": targets,
"license": "CC BY 4.0", # license under which the original dataset was published
"links": [ # list of relevant links (original dataset, other uses, etc.)
{
"url": "https://academic.oup.com/nar/article/44/D1/D1075/2502602?login=false",
"description": "corresponding publication",
},
{
"url": "https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/sider.csv.gz",
"description": "Data source",
},
],
"num_points": num_points, # number of datapoints in this dataset
"bibtex": [
"""@article{10.1093/nar/gkv1075,
author = {Kuhn, Michael and Letunic, Ivica and Jensen, Lars Juhl and Bork, Peer},
title = "{The SIDER database of drugs and side effects}",
journal = {Nucleic Acids Research},
volume = {44},
number = {D1},
pages = {D1075-D1079},
year = {2015},
month = {10},
issn = {0305-1048},
doi = {10.1093/nar/gkv1075},
url = {https://doi.org/10.1093/nar/gkv1075},
}""",
],
"templates": templates,
}

def str_presenter(dumper, data):
"""configures yaml for dumping multiline strings
Ref: https://stackoverflow.com/questions/8640959/how-can-i-control-what-scalar-form-pyyaml-uses-for-my-data
"""
if data.count("\n") > 0: # check for multiline string
return dumper.represent_scalar("tag:yaml.org,2002:str", data, style="|")
return dumper.represent_scalar("tag:yaml.org,2002:str", data)

yaml.add_representer(str, str_presenter)
yaml.representer.SafeRepresenter.add_representer(
str, str_presenter
) # to use with safe_dum
fn_meta = "meta.yaml"
with open(fn_meta, "w") as f:
yaml.dump(meta, f, sort_keys=False)

print(f"Finished processing {meta['name']} dataset!")


def main():
sider, old_columns = transform_data()
write_meta(sider.columns, old_columns, len(sider))


if __name__ == "__main__":
main()
Loading