A general purpose subjective evaluation tool for LLMs using LLM-as-a-Judge. It has already been configured to support software design evaluation tasks for DevBench.
The evaluation is set to judge whether a response generated by a given model is better than that of a reference model under our predefined criteria. See evaluating_guidance for detailed metrics for different software design files.
More details on the terminology and instructions in SubEval context can be found here.
# Inside the subeval dir
pip install -e .
You should have a json
file to store your keys (for example, you can name it as keys.json
), its content will be:
{
"openai-keys": [
"",
""
]
}
Before running the evaluation scripts, set the environment variable export KEYS=/path/to/your/keys.json
.
Also, as all the provided scripts are written to be run from the top level, you should set export PYTHONPATH=path/to/SubEval
or export PYTHONPATH=$PWD
if you are at the top level directory.
We provided the example data from DevBench in examples/DevBench_projects_example.xlsx
(here) to run the Subjective Evaluation Tool.
The data includes responses from the following models (4 GPT models and 6 open-source models):
gpt-3.5-turbo-1106
gpt-4-0613
gpt-4-1106-preview
gpt-4-0125-preview
codellama-7b-instruct
codellama-13b-instruct
codellama-34b-instruct
deepseek-coder-1.3b-instruct
deepseek-coder-6.7b-instruct
deepseek-coder-33b-instruct
The current available GPT judges are here.
Feel free to add your own models and judges.
To evaluate gpt-4-0125-preview
's response using gpt-3.5-turbo-1106
as the reference model and gpt-4-1106-preview
as the judge, run the following command:
python3 subeval/subjective/sub_eval.py --data examples/DevBench_projects_example.xlsx --model gpt-4-0125-preview --refm gpt-3.5-turbo-1106 --judge gpt-4-1106-preview --eval-nopt 2 --eval-proc 1 --mode dual
or this script:
chmod +x ./scipts/run_example.sh
./scripts/run_example.sh
To evaluate all models' responses using gpt-3.5-turbo-1106
as both the reference model and the judge, run the following command:
python3 subeval/subjective/sub_eval.py --data examples/DevBench_projects_example.xlsx --model gpt-3.5-turbo-1106 gpt-4-0613 gpt-4-1106-preview gpt-4-0125-preview codellama-7b-instruct codellama-13b-instruct codellama-34b-instruct deepseek-coder-1.3b-instruct deepseek-coder-6.7b-instruct deepseek-coder-33b-instruct --refm gpt-3.5-turbo-1106 --judge gpt-3.5-turbo-1106 --eval-nopt 2 --eval-proc 1 --mode dual
or this script:
chmod +x ./scipts/run_all_examples.sh
./scripts/run_all_examples.sh
Here's the brief overview of necessary arguments. See subeval/subjective/subeval.py
(here) for more detail.
--data
: The formatted excel file. Format see here--model
: The models whose responses will be evaluated.--refm
: The reference/baseline model to be compared to.--judge
: The judge model to evaluate responses.--eval-nopt
: The number of options (explained here)--eval-proc
: The number of processes for multi-process evaluation.--mode
: The mode for response order given to the judge. Choose betweendual
orrandom
. However, we recommend to use , and used for DevBench as well,dual
mode for consistent evaluations. TODO @lin Typo. Add ref.--fill-contents
: Optional to load actual contents from paths of a given column (explained here).
The SubEval was developed based on another codebase of ours, which we adapted for the DevBench evaluation. As a result, this part now contains some legacy code. We further processed the original outputs to obtain the results presented in the paper. We apologize for any inconvenience this may cause.
The results of subjective evaluation will be stored in the directory output/{df_name}_infer_input_{seed}_record_{judge}_{nopt}
. Among all the ouput files, log.txt
integrates all evaluation results, and record_{judge}_{nopt}.tsv
is the detailed output of evaluation for each pair of responses.
We included the example evaluation results here (log.txt
generated by running the abovementioned two scripts).
The old version win rate is calculated only on consistent pairs of evaluations (see here for the definition of consistent and inconsistent in our context). We also provide a new version win rate calculation method specified as below, which considers inconsistent evaluation due to swapped response-order in the prompt as a "tie" for both models.
For more detailed explanation and interpretation, see subeval.md.
In our DevBench paper, we applied this new version win rate calculation method. It regards the inconsistent pairs of evalutations as a "tie" for both models.
To use the new version calculation, run the following script from the top-level directory:
python ./subeval/subjective/calculate_winrate_new.py
If you specified the directory to save the calculation results, there will be two csv files get saved:
win_rate_with_tie.csv
win_rate_without_tie.csv
For the two examples that we provided, run_example.sh
gave all consistent evaluations, meaning that there are no difference between the old and new version win rate calculation; hence there is no need to calculate again using the new version.
However, run_all_examples.sh
, resulted in some inconsistency. We then calculated the new version win rate with tie considered and without tie considered.
You can choose to further run the new version win rate calculation based on your need and experiment results. We highly recommend to run this new version calculation for eval-nopt 2
cases.
Now you are ready to go! Feel free to customize the Subjective Evaluation Tool to fit your need!