GRPM BERTopic Analysis

This repository contains the Jupyter notebook grpm_bertopic.ipynb which a topic modeling pipeline BERTopic based to unravel hidden topic among Pubmed genetic literature.

Repository Structure

data: Harvest the data produced during the notebook execution.
utlis: Contains accessory python code.
grpm_bertopic.ipynb : The Jupyter notebook file contains all the steps, code and detailed information about the GRPM BERTopic analysis process.
bertopic_tutorial.ipynb: Jupyter notebook created for educational purposes.

Requirements

All required libraries and their specific versions used for this project are listed within the grpm_bertopic.ipynb notebook. Make sure to install these dependencies before go through the notebook.

Usage

To perform the GRPM BERTopic analysis, follow the steps laid out in the grpm_bertopic.ipynb notebook. Following these steps, you'll be able to unravel the intricate connections between genetic variations and MeSH term provided.

The general workflow has been depicted below:

About GRPM BERTopic Analysis

The GRPM BERTopic Analysis utilizes a structured approach to extract and examine themes from a collection of scholarly abstracts related to human genetic polymorphisms. Below outlines the key steps of the analysis:

Dataset Acquisition: The analysis starts by retrieving a dataset of scholarly abstracts focusing on human genetic polymorphisms. This dataset, termed the GRPM Dataset, integrates data from sources like LitVar and PubMed. You can access the dataset via the DOI link: .
Data Preprocessing: The preprocessing phase utilizes a user-defined set of Medical Subject Headings (MeSH) terms to curate the corpus of abstracts. An example of MeSH terms is available in the data/ref-mesh.csv file. This step is crucial as it refines the abstracts' corpus, preparing it for effective topic modeling.
Topic Modeling with BERTopic: The refined corpus undergoes topic modeling using the BERTopic architecture. This framework employs advanced hierarchical clustering techniques to uncover the latent thematic structures of the abstracts, providing a comprehensive overview of the topic model's underlying architecture.
Data Post-processing: Finally, selected topics undergo post-processing, highlighting specific themes for in-depth exploration. This stage enhances the understanding of genetic influences pertinent to biomedical fields, as specified by the custom MeSH terms.

This pipeline is designed to identify significant patterns and relationships within a set of abstracts, offering insights into potential genetic contributions within the biomedical domain.

If you encounter any issues or have any questions, feel free to open an issue in this repository.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data		data
utils		utils
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md
bertopic_tutorial.ipynb		bertopic_tutorial.ipynb
grpm_bertopic.ipynb		grpm_bertopic.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GRPM BERTopic Analysis

Repository Structure

Requirements

Usage

About GRPM BERTopic Analysis

About

Releases

Packages

Languages

License

johndef64/grpm_bertopic

Folders and files

Latest commit

History

Repository files navigation

GRPM BERTopic Analysis

Repository Structure

Requirements

Usage

About GRPM BERTopic Analysis

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages