DS2: Improving Data Efficiency via Curating LLM-Driven Rating Systems

Jinlong Pang, Jiaheng Wei, Ankit Parag Shah, Zhaowei Zhu, Yaxuan Wang, Chen Qian, Yang Liu, Yujia Bao and Wei Wei.

REAL Lab, University of California, Santa Cruz

🎉🎉 News

[2025.01.22] 👏👏 Accepted by ICLR 2025.
[2024.11.10] 📢📢 Release the curated dataset in the Huggingface.
[2024.10.08] 🚀🚀 Release the code of DS2.

Brief Introduction

This project is motivated by the frequent and widespread errors in LLM-generated raw rating scores, which can vary significantly across different models. The score errors can be visualized by a score transition matrix (A larger value on the matrix’s diagonal indicates that the score error is smaller)

In response, we introduce DS2, a diversity-aware score curation approach to enhance data selection.

Prompt-based LLM Rating: We generate an initial quality score for each data sample using advanced LLMs.
Curated Quality Score Generation: This step corrects potential rating score errors from the previous step by leveraging the Score Transition Matrix to derive a curated quality score.
Long-tail Diversity Score Generation: We score the diversity of each example by measuring the distance between feature embeddings, identifying samples that fall outside common clusters, which tend to be more distinct.
Final Data Selection: We prioritize data by first sorting based on the curated scores and then by the long-tail scores. This dual sorting strategy helps with removing poor-quality outliers while ensuring a diverse, high-quality dataset.

Dataset preparation

One can download the evaluation/training data by

# eval data
bash model_finetune/prepare_eval_data.sh

# train data
bash model_finetune/prepare_train_data.sh

🚀🚀 Quick Start

🧩 Step 1. LLM-prompt-based rating

In this project, we use three labeling models to generate rating scores, including GPT-4o-mini, Mistral-7B-Instruct-v0.3, LLaMA-3.1-8B-Instruct. One can obtain the LLM-generated rating score by:

#Open-source LLMs
cd LLM_scoring && bash scoring.sh

# Api call
cd LLM_scoring && bash scoring_api.sh

🧩 Step 2. Score curation

One can execute the score curation by running

cd score_curation && bash diagnose.sh

The corresponding curation report files can be found in the path score_curation_results/.

🧩 Step 3. Data selection

Given the generated score curation reports, one can directly generate the high-quality subset by

python subset_generation.py

The generated subsets can be further used for the following LLM instruction tuning.

🧩 Step 4. Finetune & Evaluation

The generated subsets in the selected_data path can be used for LLM instruction tuning. Here, for easily reproduction, one can directly finetune the models by (Codebase: TULU)

cd model_finetune && bash run_pipeline.sh

Citation

If you used this repository, please cite our work:

@article{pang2024improving,
  title={Improving Data Efficiency via Curating LLM-Driven Rating Systems},
  author={Pang, Jinlong and Wei, Jiaheng and Shah, Ankit Parag and Zhu, Zhaowei and Wang, Yaxuan and Qian, Chen and Liu, Yang and Bao, Yujia and Wei, Wei},
  journal={International Conference on Learning Representations},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 105 Commits
LLM_scoring		LLM_scoring
figs		figs
model_finetune		model_finetune
score_curation		score_curation
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
subset_generation.py		subset_generation.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DS2: Improving Data Efficiency via Curating LLM-Driven Rating Systems

🎉🎉 News

Brief Introduction

Dataset preparation

🚀🚀 Quick Start

🧩 Step 1. LLM-prompt-based rating

🧩 Step 2. Score curation

🧩 Step 3. Data selection

🧩 Step 4. Finetune & Evaluation

Citation

About

Releases

Packages

Languages

License

UCSC-REAL/DS2

Folders and files

Latest commit

History

Repository files navigation

DS2: Improving Data Efficiency via Curating LLM-Driven Rating Systems

🎉🎉 News

Brief Introduction

Dataset preparation

🚀🚀 Quick Start

🧩 Step 1. LLM-prompt-based rating

🧩 Step 2. Score curation

🧩 Step 3. Data selection

🧩 Step 4. Finetune & Evaluation

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages