Chemicals, diseases, and their relations play central roles in many areas of biomedical research and healthcare such as drug discovery and safety surveillance. Although the ultimate goal in drug discovery is to develop chemicals for therapeutics, recognition of adverse drug reactions between chemicals and diseases is important for improving chemical safety and toxicity studies and facilitating new screening assays for pharmaceutical compound survival. In addition, identification of chemicals as biomarkers can be helpful in informing potential relationships between chemicals and pathologies. More info about the task.
Token Classification and Relation Detection for Bio Articles. Work with Tokens Classification on biological articles and use the resulting model in another task - Relations Extraction between named entities. Entities are Chemical
and Disease
.
The first step was data pre-processing and extracting features needed to work with. Python scripts: parser.py
, to_iob_converter.py
, cid_data_extractor.py
were written for that purpose. The data could be found in folder
Results of finetuning BERT, SciBERT, and BioBERT:
precision, recall and f1 score shown in the table below are macro avg (arithmetic mean) of those metrics for 5 classes: B-Chemical, I-Chemical, B-Disease, I-Disease, and O. For the finetuning task scikit-learn wrapper was used. Code for this part could be found in ModelsForNERComparison.ipynb
notebook.
Model | Precision | Recall | F1 Score | Accuracy | Model Description |
---|---|---|---|---|---|
BERT bert-base-cased |
0.83 | 0.70 | 0.76 | 0.95 | HuggigFace, GitHub, Paper |
SciBERT scibert-scivocab-cased |
0.86 | 0.77 | 0.81 | 0.96 | HuggingFace, GitHub, Paper |
BioBERT biobert-v1.1-pubmed-base-cased model |
0.86 | 0.72 | 0.78 | 0.95 | GitHub, Paper |
SciBERT has shown the best performance on given data, so it was chosen for further improvements and visualization of results, which could be found in Scibert_TokenClassification.ipynb
notebook. The SciBERT model was also finetuned with SpaCy pipelines in Finetuning_SciBERT_with_SpaCy_Pipeline.ipynb
notebook for more comfortable further usage.
The final approach with Knowledge Graphs could be found in RD_KG_solution.ipynb
notebook. Experiments with KG on the given dataset could be found in KnowledgeGraphs.ipynb
notebook. The core idea was to analyze dependencies between words in sentences, extract objects, subjects, and relations, and then use the trained NER model to filter Diseases and Chemicals from them. The resulting .tsv file, that contains relations can be found by link. All in all, such approach has some issues, like small number of entity1-relation-entity2 triples, that are left after filtering. Here is the visualization of the resulting Knowledge Graph.
parser.py
- parsing .txt files, feature engineering, converting to .csvto_iob_converter.py
- converter to the IOB (Inside–outside–beginning) format - common tagging format for tagging tokenscid_data_extractor.py
- extracting related name-entity pairs from DNER and CID parts of the datasetsModelsForNERComparison.ipynb
(nbviewer) - Finetuningbert-base-cased
,scibert-scivocab-cased
andbiobert-v1.1-pubmed-base-cased model
on dataset in IOB format. Comparing the resultsScibert_TokenClassification.ipynb
(nbviewer) - Further work withscibert-scivocab-cased
as it has shown the best performance among other models. Developing functions for extracting entities from user's text and visualizing results withdisplacy
Finetuning_SciBERT_with_SpaCy_Pipeline.ipynb
- Using spaCy 3 library to finetune SciBERT for NER task with SpaCy PipelineKnowledgeGraphs.ipynb
(nbviewer) - Trying out Relation Extraction methods without usage of NER Entities. Developing functions for further building of Knowledge Graphs and visualizaing th results.RD_KG_solution.ipynb
- Final approach with Knowledge Graphs