Skip to content
/ DS2 Public

[ICLR 2025] Improving Data Efficiency via Curating LLM-Driven Rating Systems

License

Notifications You must be signed in to change notification settings

UCSC-REAL/DS2

Repository files navigation

DS2: Improving Data Efficiency via Curating LLM-Driven Rating Systems

Jinlong Pang, Jiaheng Wei, Ankit Parag Shah, Zhaowei Zhu, Yaxuan Wang, Chen Qian, Yang Liu, Yujia Bao and Wei Wei.

REAL Lab, University of California, Santa Cruz


🎉🎉 News

  • [2025.01.22] 👏👏 Accepted by ICLR 2025.
  • [2024.11.10] 📢📢 Release the curated dataset in the Huggingface.
  • [2024.10.08] 🚀🚀 Release the code of DS2.

Brief Introduction

This project is motivated by the frequent and widespread errors in LLM-generated raw rating scores, which can vary significantly across different models. The score errors can be visualized by a score transition matrix (A larger value on the matrix’s diagonal indicates that the score error is smaller)

Score Transition Matrix

In response, we introduce DS2, a diversity-aware score curation approach to enhance data selection.

The Overview of Data Selection Pipeline

  • Prompt-based LLM Rating: We generate an initial quality score for each data sample using advanced LLMs.
  • Curated Quality Score Generation: This step corrects potential rating score errors from the previous step by leveraging the Score Transition Matrix to derive a curated quality score.
  • Long-tail Diversity Score Generation: We score the diversity of each example by measuring the distance between feature embeddings, identifying samples that fall outside common clusters, which tend to be more distinct.
  • Final Data Selection: We prioritize data by first sorting based on the curated scores and then by the long-tail scores. This dual sorting strategy helps with removing poor-quality outliers while ensuring a diverse, high-quality dataset.

Dataset preparation

One can download the evaluation/training data by

# eval data
bash model_finetune/prepare_eval_data.sh

# train data
bash model_finetune/prepare_train_data.sh

🚀🚀 Quick Start

🧩 Step 1. LLM-prompt-based rating

In this project, we use three labeling models to generate rating scores, including GPT-4o-mini, Mistral-7B-Instruct-v0.3, LLaMA-3.1-8B-Instruct. One can obtain the LLM-generated rating score by:

#Open-source LLMs
cd LLM_scoring && bash scoring.sh

# Api call
cd LLM_scoring && bash scoring_api.sh

🧩 Step 2. Score curation

One can execute the score curation by running

cd score_curation && bash diagnose.sh

The corresponding curation report files can be found in the path score_curation_results/.


🧩 Step 3. Data selection

Given the generated score curation reports, one can directly generate the high-quality subset by

python subset_generation.py

The generated subsets can be further used for the following LLM instruction tuning.


🧩 Step 4. Finetune & Evaluation

The generated subsets in the selected_data path can be used for LLM instruction tuning. Here, for easily reproduction, one can directly finetune the models by (Codebase: TULU)

cd model_finetune && bash run_pipeline.sh

Citation

If you used this repository, please cite our work:

@article{pang2024improving,
  title={Improving Data Efficiency via Curating LLM-Driven Rating Systems},
  author={Pang, Jinlong and Wei, Jiaheng and Shah, Ankit Parag and Zhu, Zhaowei and Wang, Yaxuan and Qian, Chen and Liu, Yang and Bao, Yujia and Wei, Wei},
  journal={International Conference on Learning Representations},
  year={2025}
}

About

[ICLR 2025] Improving Data Efficiency via Curating LLM-Driven Rating Systems

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published