Token Classification and Relation Detection

Chemicals, diseases, and their relations play central roles in many areas of biomedical research and healthcare such as drug discovery and safety surveillance. Although the ultimate goal in drug discovery is to develop chemicals for therapeutics, recognition of adverse drug reactions between chemicals and diseases is important for improving chemical safety and toxicity studies and facilitating new screening assays for pharmaceutical compound survival. In addition, identification of chemicals as biomarkers can be helpful in informing potential relationships between chemicals and pathologies. More info about the task.

To Do

Token Classification and Relation Detection for Bio Articles. Work with Tokens Classification on biological articles and use the resulting model in another task - Relations Extraction between named entities. Entities are Chemical and Disease.

Data Preparation

The first step was data pre-processing and extracting features needed to work with. Python scripts: parser.py, to_iob_converter.py, cid_data_extractor.py were written for that purpose. The data could be found in folder

Token Classification

Results of finetuning BERT, SciBERT, and BioBERT: precision, recall and f1 score shown in the table below are macro avg (arithmetic mean) of those metrics for 5 classes: B-Chemical, I-Chemical, B-Disease, I-Disease, and O. For the finetuning task scikit-learn wrapper was used. Code for this part could be found in ModelsForNERComparison.ipynb notebook.

Model	Precision	Recall	F1 Score	Accuracy	Model Description
BERT `bert-base-cased`	0.83	0.70	0.76	0.95	HuggigFace, GitHub, Paper
SciBERT `scibert-scivocab-cased`	0.86	0.77	0.81	0.96	HuggingFace, GitHub, Paper
BioBERT `biobert-v1.1-pubmed-base-cased model`	0.86	0.72	0.78	0.95	GitHub, Paper

SciBERT has shown the best performance on given data, so it was chosen for further improvements and visualization of results, which could be found in Scibert_TokenClassification.ipynb notebook. The SciBERT model was also finetuned with SpaCy pipelines in Finetuning_SciBERT_with_SpaCy_Pipeline.ipynb notebook for more comfortable further usage.

Knowledge Graphs

The final approach with Knowledge Graphs could be found in RD_KG_solution.ipynb notebook. Experiments with KG on the given dataset could be found in KnowledgeGraphs.ipynb notebook. The core idea was to analyze dependencies between words in sentences, extract objects, subjects, and relations, and then use the trained NER model to filter Diseases and Chemicals from them. The resulting .tsv file, that contains relations can be found by link. All in all, such approach has some issues, like small number of entity1-relation-entity2 triples, that are left after filtering. Here is the visualization of the resulting Knowledge Graph.

Details

Picture 1: Label prediction for entities of three models on Test Set

Picture 2: Fine-Tuned SciBERT Metrics

Files & Notebooks

parser.py - parsing .txt files, feature engineering, converting to .csv
to_iob_converter.py - converter to the IOB (Inside–outside–beginning) format - common tagging format for tagging tokens
cid_data_extractor.py - extracting related name-entity pairs from DNER and CID parts of the datasets
ModelsForNERComparison.ipynb (nbviewer) - Finetuning bert-base-cased, scibert-scivocab-cased and biobert-v1.1-pubmed-base-cased model on dataset in IOB format. Comparing the results
Scibert_TokenClassification.ipynb (nbviewer) - Further work with scibert-scivocab-cased as it has shown the best performance among other models. Developing functions for extracting entities from user's text and visualizing results with displacy
Finetuning_SciBERT_with_SpaCy_Pipeline.ipynb - Using spaCy 3 library to finetune SciBERT for NER task with SpaCy Pipeline
KnowledgeGraphs.ipynb (nbviewer) - Trying out Relation Extraction methods without usage of NER Entities. Developing functions for further building of Knowledge Graphs and visualizaing th results.
RD_KG_solution.ipynb - Final approach with Knowledge Graphs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Token Classification and Relation Detection

To Do

Data Preparation

Token Classification

Knowledge Graphs

Details

Picture 1: Label prediction for entities of three models on Test Set

Picture 2: Fine-Tuned SciBERT Metrics

Files & Notebooks

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 100 Commits
data		data
img		img
models		models
Finetuning SciBERT with SpaCy Pipeline.ipynb		Finetuning SciBERT with SpaCy Pipeline.ipynb
KnowledgeGraphs.ipynb		KnowledgeGraphs.ipynb
ModelsForNERComparison.ipynb		ModelsForNERComparison.ipynb
RD_KG_solution.ipynb		RD_KG_solution.ipynb
README.md		README.md
Scibert_TokenClassification.ipynb		Scibert_TokenClassification.ipynb
cid_data_extractor.py		cid_data_extractor.py
parser.py		parser.py
to_iob_converter.py		to_iob_converter.py

Teasotea/BioNER-and-RD

Folders and files

Latest commit

History

Repository files navigation

Token Classification and Relation Detection

To Do

Data Preparation

Token Classification

Knowledge Graphs

Details

Picture 1: Label prediction for entities of three models on Test Set

Picture 2: Fine-Tuned SciBERT Metrics

Files & Notebooks

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages