This repository contains the Jupyter notebook grpm_bertopic.ipynb
which a topic modeling pipeline BERTopic based to unravel hidden topic among Pubmed genetic literature.
data
: Harvest the data produced during the notebook execution.utlis
: Contains accessory python code.grpm_bertopic.ipynb
: The Jupyter notebook file contains all the steps, code and detailed information about the GRPM BERTopic analysis process.bertopic_tutorial.ipynb
: Jupyter notebook created for educational purposes.
All required libraries and their specific versions used for this project are listed within the grpm_bertopic.ipynb
notebook. Make sure to install these dependencies before go through the notebook.
To perform the GRPM BERTopic analysis, follow the steps laid out in the grpm_bertopic.ipynb
notebook. Following these steps, you'll be able to unravel the intricate connections between genetic variations and MeSH term provided.
The general workflow has been depicted below:
The GRPM BERTopic Analysis utilizes a structured approach to extract and examine themes from a collection of scholarly abstracts related to human genetic polymorphisms. Below outlines the key steps of the analysis:
-
Dataset Acquisition: The analysis starts by retrieving a dataset of scholarly abstracts focusing on human genetic polymorphisms. This dataset, termed the GRPM Dataset, integrates data from sources like LitVar and PubMed. You can access the dataset via the DOI link: .
-
Data Preprocessing: The preprocessing phase utilizes a user-defined set of Medical Subject Headings (MeSH) terms to curate the corpus of abstracts. An example of MeSH terms is available in the
data/ref-mesh.csv
file. This step is crucial as it refines the abstracts' corpus, preparing it for effective topic modeling. -
Topic Modeling with BERTopic: The refined corpus undergoes topic modeling using the BERTopic architecture. This framework employs advanced hierarchical clustering techniques to uncover the latent thematic structures of the abstracts, providing a comprehensive overview of the topic model's underlying architecture.
-
Data Post-processing: Finally, selected topics undergo post-processing, highlighting specific themes for in-depth exploration. This stage enhances the understanding of genetic influences pertinent to biomedical fields, as specified by the custom MeSH terms.
This pipeline is designed to identify significant patterns and relationships within a set of abstracts, offering insights into potential genetic contributions within the biomedical domain.
If you encounter any issues or have any questions, feel free to open an issue in this repository.