Fixing up test case after api changes to add error_rate

Also added scripts/infra to the cloud instance usage to match the change to the cli and moved the first n option up which now applies to generation as well. Signed-off-by: Dan McPherson <[email protected]>
instructlab · Jul 12, 2024 · baa89ca · baa89ca
1 parent 6041f91
commit baa89ca
Show file tree

Hide file tree

Showing 3 changed files with 11 additions and 9 deletions.
diff --git a/README.md b/README.md
@@ -10,12 +10,12 @@ Python Library for Evaluation
 ## MT-Bench / MT-Bench-Branch Testing Steps
 
 ```shell
-# Optional: Use cloud-instance.sh to launch and setup the instance
-./cloud-instance.sh ec2 launch -t g5.4xlarge
-./cloud-instance.sh ec2 setup-rh-devenv
-./cloud-instance.sh ec2 install-rh-nvidia-drivers
-./cloud-instance.sh ec2 ssh sudo reboot
-./cloud-instance.sh ec2 ssh
+# Optional: Use [cloud-instance.sh](https://github.com/instructlab/instructlab/tree/main/scripts/infra) to launch and setup the instance
+scripts/infra/cloud-instance.sh ec2 launch -t g5.4xlarge
+scripts/infra/cloud-instance.sh ec2 setup-rh-devenv
+scripts/infra/cloud-instance.sh ec2 install-rh-nvidia-drivers
+scripts/infra/cloud-instance.sh ec2 ssh sudo reboot
+scripts/infra/cloud-instance.sh ec2 ssh
 
 
 # Regardless of how you setup your instance
@@ -33,6 +33,7 @@ python -m vllm.entrypoints.openai.api_server --model instructlab/granite-7b-lab
 In another shell window
 
 ```shell
+export INSTRUCTLAB_EVAL_FIRST_N_QUESTIONS=10 # Optional if you want to shorten run times
 python3 tests/test_gen_answers.py
 python3 tests/test_branch_gen_answers.py
 ```
@@ -65,7 +66,6 @@ eval_output/
 ```
 
 ```shell
-export INSTRUCTLAB_EVAL_FIRST_N_QUESTIONS=40 # Optional if you want to shorten run times
 python3 tests/test_judge_answers.py
 python3 tests/test_branch_judge_answers.py
 ```

diff --git a/tests/test_branch_judge_answers.py b/tests/test_branch_judge_answers.py
@@ -10,7 +10,8 @@
     "../taxonomy",
     "main",
 )
-qa_pairs = mt_bench_branch.judge_answers("http://localhost:8000/v1")
+qa_pairs, error_rate = mt_bench_branch.judge_answers("http://localhost:8000/v1")
+print(f"Error Rate: {error_rate}")
 print(f"QA Pair 0:")
 pprint.pprint(qa_pairs[0])
 

diff --git a/tests/test_judge_answers.py b/tests/test_judge_answers.py
@@ -5,13 +5,14 @@
 from instructlab.eval.mt_bench import MTBenchEvaluator
 
 mt_bench = MTBenchEvaluator("instructlab/granite-7b-lab", "instructlab/granite-7b-lab")
-overall_score, qa_pairs, turn_scores = mt_bench.judge_answers(
+overall_score, qa_pairs, turn_scores, error_rate = mt_bench.judge_answers(
     "http://localhost:8000/v1"
 )
 
 print(f"Overall Score: {overall_score}")
 print(f"Turn 1 Score: {turn_scores[0]}")
 print(f"Turn 2 Score: {turn_scores[1]}")
+print(f"Error Rate: {error_rate}")
 print(f"QA Pair 0:")
 pprint.pprint(qa_pairs[0])