BIOSSES is short for Biomedical Semantic Similarity Estimation System, a series of methods to assess similarity between biomedical sentences proposed by Soğancıoğlu et al. (2017). Each method in the original BIOSSES produces their own similarity scores and are then benchmarked in terms of the Pearson correlation metric.
The benchmark dataset is a table of 100 biomedical sentence pairs picked from the TAC 2014 Biomedical Summarization Track Dataset. Each sentence pair has been assigned integer similarity scores by 5 human expert annotators which are included in another table along with the sentences.
biosses_d2v implements the paragraph vector approach (Le & Mikolov, 2014) to benchmarking BIOSSES sentences.
biosses_d2v uses the Doc2Vec model library from Gensim as it is the only popular open-source implementation of the paragraph vector model in Python as of now (documentation).
biosses_d2v also implements the training corpus for Doc2Vec, which is the PubMed Central Open Access (PMCOA) Subset of biomedical papers – part of which the original BIOSSES paragraph vectors were trained on. Different levels of granularity are available via a FTP server to download corpus text (see this); biosses_d2v enforces bulk downloads (i.e. not by individual papers) of Commercial and Non-Commercial packages.
biosses_d2v treats each paper from the PMCOA Subset as a document with a vector to train.
Finally, biosses_d2v.py
can be used in the CLI to execute both the training and benchmarking with just one line of code.
Downloads a corpus that can be either part or all of the PubMed Central Open Access Subset.
Corpus bulk directories to be downloaded are specified by the
packages
parameter. Since they are named after the alphanumeric grouping of journal titles they contain, an iterable of any combination of the following groupings is valid:0-9A-B
(default),C-H
,I-N
,O-Z
.
Customizes characteristics of the corpus.
These include exact number of papers to be loaded into the resulting corpus; lemmatized or not; iterator or list in memory; stopwords to remove; regex pattern to be rid of.
Stores paths to downloaded papers internally.
Enables downloading and converting biomedical sentence pair and annotator score tables into 2 separate DataFrames via get_sentence_df
and get_score_df
.
Benchmarks a Doc2Vec model via benchmark_with_d2v
with the Pearson correlation metric.
An abstraction to streamline training Doc2Vec on a particular corpus.
**kwargs
refers to any parameters passed into the instantiation of Doc2Vec here, exceptdocuments
andcorpus_file
.
It is recommended that a virtual environment be created and activated before installing any required libraries:
python3 -m venv venv
source venv/bin/activate
Then install the requirements as follows:
pip install -r requirements.txt
Commands used:
python biosses_d2v.py --use-logger --save-model -s 100 --e 1 --iterator
Train Doc2Vec on a corpus of 100 non-lemmatized PMCOA papers over 1 epoch, streaming the texts in one by one, logging training progress and saving the model.
python biosses_d2v.py --use-logger --save-model -s 100 --e 1 --lemma
Train Doc2Vec on a corpus of 100 lemmatized PMCOA papers over 1 epoch, storing the texts as a list in memory, logging training progress and saving the model.
- Optimized parameters for the default CLI command.
- Data structures to store training corpus.
- Text preprocessing.
Lemmatization engine (currenty using scispacy); stopword choices; regex patterns to remove unwanted features (could try r"(\s+((([À-ÿA-Za-z\s-.,;&])+\s((\d{4}[a-z])+))+)|(\s+[[\d\s+,;&[]]+])" for getting rid of majority of in-text citations), etc.
- Should a
TaggedDocument
be another unit of text but an entire paper like now? Maybe try a single paragraph? Or sentence? - Other metrics to benchmark by.
- Other corpora to train Doc2Vec on.
- Other Doc2Vec implementations.