Skip to content

Latest commit

 

History

History
26 lines (21 loc) · 1.51 KB

README.md

File metadata and controls

26 lines (21 loc) · 1.51 KB

Improving Clinical NLP Performance with Synthetic Clinical Data

Overview

This repository hosts the code and resources for the research project "Improving Clinical NLP Performance through Language Model-Generated Synthetic Clinical Data," aimed at enhancing clinical natural language processing (NLP) methods through the use of synthetic data generated by advanced language models. Our study demonstrates the feasibility and effectiveness of augmenting NLP model training with high-quality synthetic clinical text, showing promising applications in the high-stakes domain of healthcare.

Key Features

  • Synthetic Data Generation: Utilizes large language models (LLMs) to create synthetic annotated clinical text datasets.
  • Label Correction Technique: An active learning step applied to enhance the quality of synthetic datasets.
  • Benchmark Evaluation: Assessment of model performance on NLP benchmarks and real-world long document clinical datasets.

In this repo

  • We showed detailed Prompts for Synthetic Data Generation for all our tasks among the 4 .py files

Contact

For any queries, please reach out to Dr. Danielle S. Bitterman at [email protected].

Citation

If you use our work in your research, please cite:

@article{chen2024improving,
  title={Improving Clinical NLP Performance through Language Model-Generated Synthetic Clinical Data},
  author={Chen, Shan and Gallifant, Jack and Guevara, Marco and Gao, Yanjun and others},
  journal={arXiv preprint arXiv:XXXX.XXXX},
  year={2024}
}