Jinlong Pang, Jiaheng Wei, Ankit Parag Shah, Zhaowei Zhu, Yaxuan Wang, Chen Qian, Yang Liu, Yujia Bao and Wei Wei.
REAL Lab, University of California, Santa Cruz
- [2025.01.22] 👏👏 Accepted by ICLR 2025.
- [2024.11.10] 📢📢 Release the curated dataset in the Huggingface.
- [2024.10.08] 🚀🚀 Release the code of DS2.
This project is motivated by the frequent and widespread errors in LLM-generated raw rating scores, which can vary significantly across different models. The score errors can be visualized by a score transition matrix (A larger value on the matrix’s diagonal indicates that the score error is smaller)
In response, we introduce DS2, a diversity-aware score curation approach to enhance data selection.
- Prompt-based LLM Rating: We generate an initial quality score for each data sample using advanced LLMs.
- Curated Quality Score Generation: This step corrects potential rating score errors from the previous step by leveraging the Score Transition Matrix to derive a curated quality score.
- Long-tail Diversity Score Generation: We score the diversity of each example by measuring the distance between feature embeddings, identifying samples that fall outside common clusters, which tend to be more distinct.
- Final Data Selection: We prioritize data by first sorting based on the curated scores and then by the long-tail scores. This dual sorting strategy helps with removing poor-quality outliers while ensuring a diverse, high-quality dataset.
One can download the evaluation/training data by
# eval data
bash model_finetune/prepare_eval_data.sh
# train data
bash model_finetune/prepare_train_data.sh
In this project, we use three labeling models to generate rating scores, including GPT-4o-mini, Mistral-7B-Instruct-v0.3, LLaMA-3.1-8B-Instruct. One can obtain the LLM-generated rating score by:
#Open-source LLMs
cd LLM_scoring && bash scoring.sh
# Api call
cd LLM_scoring && bash scoring_api.sh
One can execute the score curation by running
cd score_curation && bash diagnose.sh
The corresponding curation report files can be found in the path score_curation_results/
.
Given the generated score curation reports, one can directly generate the high-quality subset by
python subset_generation.py
The generated subsets can be further used for the following LLM instruction tuning.
The generated subsets in the selected_data
path can be used for LLM instruction tuning. Here, for easily reproduction, one can directly finetune the models by (Codebase: TULU)
cd model_finetune && bash run_pipeline.sh
If you used this repository, please cite our work:
@article{pang2024improving,
title={Improving Data Efficiency via Curating LLM-Driven Rating Systems},
author={Pang, Jinlong and Wei, Jiaheng and Shah, Ankit Parag and Zhu, Zhaowei and Wang, Yaxuan and Qian, Chen and Liu, Yang and Bao, Yujia and Wei, Wei},
journal={International Conference on Learning Representations},
year={2025}
}