Improving Clinical NLP Performance with Synthetic Clinical Data

Overview

This repository hosts the code and resources for the research project "Improving Clinical NLP Performance through Language Model-Generated Synthetic Clinical Data," aimed at enhancing clinical natural language processing (NLP) methods through the use of synthetic data generated by advanced language models. Our study demonstrates the feasibility and effectiveness of augmenting NLP model training with high-quality synthetic clinical text, showing promising applications in the high-stakes domain of healthcare.

Key Features

Synthetic Data Generation: Utilizes large language models (LLMs) to create synthetic annotated clinical text datasets.
Label Correction Technique: An active learning step applied to enhance the quality of synthetic datasets.
Benchmark Evaluation: Assessment of model performance on NLP benchmarks and real-world long document clinical datasets.

In this repo

We showed detailed Prompts for Synthetic Data Generation for all our tasks among the 4 .py files

Contact

For any queries, please reach out to Dr. Danielle S. Bitterman at dbitterman@bwh.harvard.edu.

Citation

If you use our work in your research, please cite:

@article{chen2024improving,
  title={Improving Clinical NLP Performance through Language Model-Generated Synthetic Clinical Data},
  author={Chen, Shan and Gallifant, Jack and Guevara, Marco and Gao, Yanjun and others},
  journal={arXiv preprint arXiv:XXXX.XXXX},
  year={2024}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Improving Clinical NLP Performance with Synthetic Clinical Data

Overview

Key Features

In this repo

Contact

Citation

Files

README.md

Latest commit

History

README.md

File metadata and controls

Improving Clinical NLP Performance with Synthetic Clinical Data

Overview

Key Features

In this repo

Contact

Citation