Skip to content

johndef64/grpm_bertopic

Repository files navigation

GRPM BERTopic Analysis

This repository contains the Jupyter notebook grpm_bertopic.ipynb which a topic modeling pipeline BERTopic based to unravel hidden topic among Pubmed genetic literature.

Open In Colab

Repository Structure

  • data: Harvest the data produced during the notebook execution.
  • utlis: Contains accessory python code.
  • grpm_bertopic.ipynb : The Jupyter notebook file contains all the steps, code and detailed information about the GRPM BERTopic analysis process.
  • bertopic_tutorial.ipynb: Jupyter notebook created for educational purposes.

Requirements

All required libraries and their specific versions used for this project are listed within the grpm_bertopic.ipynb notebook. Make sure to install these dependencies before go through the notebook.

Usage

To perform the GRPM BERTopic analysis, follow the steps laid out in the grpm_bertopic.ipynb notebook. Following these steps, you'll be able to unravel the intricate connections between genetic variations and MeSH term provided.

The general workflow has been depicted below: Workflow

About GRPM BERTopic Analysis

The GRPM BERTopic Analysis utilizes a structured approach to extract and examine themes from a collection of scholarly abstracts related to human genetic polymorphisms. Below outlines the key steps of the analysis:

  1. Dataset Acquisition: The analysis starts by retrieving a dataset of scholarly abstracts focusing on human genetic polymorphisms. This dataset, termed the GRPM Dataset, integrates data from sources like LitVar and PubMed. You can access the dataset via the DOI link: DOI.

  2. Data Preprocessing: The preprocessing phase utilizes a user-defined set of Medical Subject Headings (MeSH) terms to curate the corpus of abstracts. An example of MeSH terms is available in the data/ref-mesh.csv file. This step is crucial as it refines the abstracts' corpus, preparing it for effective topic modeling.

  3. Topic Modeling with BERTopic: The refined corpus undergoes topic modeling using the BERTopic architecture. This framework employs advanced hierarchical clustering techniques to uncover the latent thematic structures of the abstracts, providing a comprehensive overview of the topic model's underlying architecture.

  4. Data Post-processing: Finally, selected topics undergo post-processing, highlighting specific themes for in-depth exploration. This stage enhances the understanding of genetic influences pertinent to biomedical fields, as specified by the custom MeSH terms.

This pipeline is designed to identify significant patterns and relationships within a set of abstracts, offering insights into potential genetic contributions within the biomedical domain.

If you encounter any issues or have any questions, feel free to open an issue in this repository.