Fixes in LLMJudge #1498

lilacheden · 2025-01-12T13:02:30Z

allow changing main_score and score_prefix (and change default from "score" which doesn't support score_prefix)
Support embedded task data fields in LLMJudge template

Signed-off-by: lilacheden <[email protected]>

coveralls · 2025-01-12T13:08:41Z

coverage: 79.384% (-0.04%) from 79.426%
when pulling 9754a63 on fix_llmjudge
into 0ed5ff6 on main.

Signed-off-by: lilacheden <[email protected]>

OfirArviv · 2025-01-12T13:18:55Z

src/unitxt/llm_as_judge.py

@@ -149,7 +150,7 @@ def get_contexts(self, task_data: List[Dict[str, Any]]) -> List[Dict[str, str]]:
        return [
            get_parsed_context(
                {
-                    context_field: td[context_field]
+                    context_field.split("/")[-1]: dict_get(td, context_field)


why the change?

to be able to add nested fields. For examples, a user asked to use only the instruction from the original template without the full source, this way we can send metadata/template/instruction

@elronbandel is it an acceptable way to do it, that is fully supported, and won't be prone to changes in the future?

anyway you need to add a documentation to this @lilacheden

@OfirArviv @elronbandel -
If you want it even cleaner we can support a dict of name:possibly-nested field as the context_fields

Yes. I think @martinscooper highlighted the need to be able to rename context fields.

ok, added it

Does this allow for renaming context field then? If so, do the keys correspond to the final context names used in the prompts and the values correspond to the task_data key names?

I think this is useful to adapt the task data to the criteria. For example, the squad dataset uses the term context but the coherence criteria description uses original text

@martinscooper - yes, this way you can send a dictionary where the key is the name of the field in the prompt and the value is the field name (or path) in the task data.
e.g. {"instructions":"metadata/template/instruction"} - the prompt will mention context instructions

If a list is sent then the behavior is as before - each item is both the key and the value:
["question"] -> {"question":"question"}

martinscooper · 2025-01-12T14:36:04Z

FYI: @elronbandel, @yoavkatz and I have been discussing and working on improving the reported scores for both direct and pairwise evaluators. The changes are included in this PR. You can look at this and this commits for direct score changes, and this other one for pairwise score changes.

In summary:

direct evaluators's main_score will be the criteria name (iff available and the criteria is the same for all instances), e.g. faithfulness. If not, a generic main_score llm_as_judge is used. A more granular score is included too, to avoid name conflicts if multiple metrics are used, it is {main_score}_{model_name}_{provider}.
pairwise evaluator's main_score is the first system's winrate -called 1_winrate-. All other system's winrate and its mean value are reported too.

yoavkatz · 2025-01-12T15:00:47Z

FYI: @elronbandel, @yoavkatz and I have been discussing and working on improving the reported scores for both direct and pairwise evaluators. The changes are included in this PR. You can look at this and this commits for direct score changes, and this other one for pairwise score changes.

In summary:

direct evaluators's main_score will be the criteria name (iff available and the criteria is the same for all instances), e.g. faithfulness. If not, a generic main_score llm_as_judge is used. A more granular score is included too, to avoid name conflicts if multiple metrics are used, it is {main_score}_{model_name}_{provider}.

pairwise evaluator's main_score is the first system's winrate -called 1_winrate-. All other system's winrate and its mean value are reported too.

Yes, I think we should not change the score names in this PR, and wait for #1467 for the changes.

Signed-off-by: lilacheden <[email protected]>

This reverts commit c94bfc5.

lilacheden · 2025-01-13T13:34:35Z

FYI: @elronbandel, @yoavkatz and I have been discussing and working on improving the reported scores for both direct and pairwise evaluators. The changes are included in this PR. You can look at this and this commits for direct score changes, and this other one for pairwise score changes.
In summary:

direct evaluators's main_score will be the criteria name (iff available and the criteria is the same for all instances), e.g. faithfulness. If not, a generic main_score llm_as_judge is used. A more granular score is included too, to avoid name conflicts if multiple metrics are used, it is {main_score}_{model_name}_{provider}.

pairwise evaluator's main_score is the first system's winrate -called 1_winrate-. All other system's winrate and its mean value are reported too.

Yes, I think we should not change the score names in this PR, and wait for #1467 for the changes.

@yoavkatz - reverted this main score change, can you approve the remaining change?

martinscooper · 2025-01-13T15:07:45Z

src/unitxt/llm_as_judge.py

@@ -725,6 +730,9 @@ def get_instance_results(

        winrates = [r["winrate"] for r in per_response_results.values()]
        all_results["score"] = max(range(len(winrates)), key=winrates.__getitem__)
+        all_results[self.main_score] = max(


Could you remove this change?

This reverts commit 9754a63.

lilacheden added 2 commits January 12, 2025 14:59

bugfix in LLMJudge - allow changing main_score and score_prefix

c94bfc5

Signed-off-by: lilacheden <[email protected]>

Support embedded task data fields in LLMJudge template

b0b9dfc

Signed-off-by: lilacheden <[email protected]>

lilacheden requested review from OfirArviv, martinscooper and yoavkatz January 12, 2025 13:02

add main_score to results

9754a63

Signed-off-by: lilacheden <[email protected]>

OfirArviv reviewed Jan 12, 2025

View reviewed changes

lilacheden added 2 commits January 12, 2025 20:35

support dict llmjudge contexts_fields

42e84f6

Signed-off-by: lilacheden <[email protected]>

Revert "bugfix in LLMJudge - allow changing main_score and score_prefix"

5c9ddc8

This reverts commit c94bfc5.

martinscooper reviewed Jan 13, 2025

View reviewed changes

Revert "add main_score to results"

6cb0c3c

This reverts commit 9754a63.

elronbandel approved these changes Jan 16, 2025

View reviewed changes

Merge branch 'main' into fix_llmjudge

3dcf444

elronbandel enabled auto-merge (squash) January 16, 2025 09:02

elronbandel merged commit 5506f9c into main Jan 16, 2025
17 of 18 checks passed

elronbandel deleted the fix_llmjudge branch January 16, 2025 09:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes in LLMJudge #1498

Fixes in LLMJudge #1498

lilacheden commented Jan 12, 2025

coveralls commented Jan 12, 2025 •

edited

Loading

OfirArviv Jan 12, 2025

lilacheden Jan 12, 2025

OfirArviv Jan 12, 2025 •

edited

Loading

OfirArviv Jan 12, 2025

lilacheden Jan 12, 2025 •

edited

Loading

yoavkatz Jan 12, 2025

lilacheden Jan 12, 2025

martinscooper Jan 13, 2025

lilacheden Jan 13, 2025

martinscooper commented Jan 12, 2025

yoavkatz commented Jan 12, 2025

lilacheden commented Jan 13, 2025

martinscooper Jan 13, 2025

lilacheden Jan 13, 2025

Fixes in LLMJudge #1498

Fixes in LLMJudge #1498

Conversation

lilacheden commented Jan 12, 2025

coveralls commented Jan 12, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

OfirArviv Jan 12, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lilacheden Jan 12, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

martinscooper commented Jan 12, 2025

yoavkatz commented Jan 12, 2025

lilacheden commented Jan 13, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coveralls commented Jan 12, 2025 •

edited

Loading

OfirArviv Jan 12, 2025 •

edited

Loading

lilacheden Jan 12, 2025 •

edited

Loading