Skip to content

Teasotea/BioNER-and-RD

Repository files navigation

Token Classification and Relation Detection

Chemicals, diseases, and their relations play central roles in many areas of biomedical research and healthcare such as drug discovery and safety surveillance. Although the ultimate goal in drug discovery is to develop chemicals for therapeutics, recognition of adverse drug reactions between chemicals and diseases is important for improving chemical safety and toxicity studies and facilitating new screening assays for pharmaceutical compound survival. In addition, identification of chemicals as biomarkers can be helpful in informing potential relationships between chemicals and pathologies. More info about the task.

To Do

Token Classification and Relation Detection for Bio Articles. Work with Tokens Classification on biological articles and use the resulting model in another task - Relations Extraction between named entities. Entities are Chemical and Disease.

Data Preparation

The first step was data pre-processing and extracting features needed to work with. Python scripts: parser.py, to_iob_converter.py, cid_data_extractor.py were written for that purpose. The data could be found in folder

Token Classification

Extracting Entities Example

Results of finetuning BERT, SciBERT, and BioBERT: precision, recall and f1 score shown in the table below are macro avg (arithmetic mean) of those metrics for 5 classes: B-Chemical, I-Chemical, B-Disease, I-Disease, and O. For the finetuning task scikit-learn wrapper was used. Code for this part could be found in ModelsForNERComparison.ipynb notebook.

Model Precision Recall F1 Score Accuracy Model Description
BERT bert-base-cased 0.83 0.70 0.76 0.95 HuggigFace, GitHub, Paper
SciBERT scibert-scivocab-cased 0.86 0.77 0.81 0.96 HuggingFace, GitHub, Paper
BioBERT biobert-v1.1-pubmed-base-cased model 0.86 0.72 0.78 0.95 GitHub, Paper

SciBERT has shown the best performance on given data, so it was chosen for further improvements and visualization of results, which could be found in Scibert_TokenClassification.ipynb notebook. The SciBERT model was also finetuned with SpaCy pipelines in Finetuning_SciBERT_with_SpaCy_Pipeline.ipynb notebook for more comfortable further usage.

Knowledge Graphs

The final approach with Knowledge Graphs could be found in RD_KG_solution.ipynb notebook. Experiments with KG on the given dataset could be found in KnowledgeGraphs.ipynb notebook. The core idea was to analyze dependencies between words in sentences, extract objects, subjects, and relations, and then use the trained NER model to filter Diseases and Chemicals from them. The resulting .tsv file, that contains relations can be found by link. All in all, such approach has some issues, like small number of entity1-relation-entity2 triples, that are left after filtering. Here is the visualization of the resulting Knowledge Graph.

Details

Picture 1: Label prediction for entities of three models on Test Set

Comparison of label prediction for entities of three models with the right labels

Picture 2: Fine-Tuned SciBERT Metrics

Fine-Tuned SciBERT Perfomance

Files & Notebooks

About

Token Classification for BioArticles

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published