Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes in LLMJudge #1498

Merged
merged 7 commits into from
Jan 16, 2025
Merged

Fixes in LLMJudge #1498

merged 7 commits into from
Jan 16, 2025

Conversation

lilacheden
Copy link
Member

  1. allow changing main_score and score_prefix (and change default from "score" which doesn't support score_prefix)
  2. Support embedded task data fields in LLMJudge template

@coveralls
Copy link

coveralls commented Jan 12, 2025

Coverage Status

coverage: 79.384% (-0.04%) from 79.426%
when pulling 9754a63 on fix_llmjudge
into 0ed5ff6 on main.

Signed-off-by: lilacheden <[email protected]>
@@ -149,7 +150,7 @@ def get_contexts(self, task_data: List[Dict[str, Any]]) -> List[Dict[str, str]]:
return [
get_parsed_context(
{
context_field: td[context_field]
context_field.split("/")[-1]: dict_get(td, context_field)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be able to add nested fields. For examples, a user asked to use only the instruction from the original template without the full source, this way we can send metadata/template/instruction

Copy link
Collaborator

@OfirArviv OfirArviv Jan 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@elronbandel is it an acceptable way to do it, that is fully supported, and won't be prone to changes in the future?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

anyway you need to add a documentation to this @lilacheden

Copy link
Member Author

@lilacheden lilacheden Jan 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@OfirArviv @elronbandel -
If you want it even cleaner we can support a dict of name:possibly-nested field as the context_fields

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I think @martinscooper highlighted the need to be able to rename context fields.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, added it

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this allow for renaming context field then? If so, do the keys correspond to the final context names used in the prompts and the values correspond to the task_data key names?

I think this is useful to adapt the task data to the criteria. For example, the squad dataset uses the term context but the coherence criteria description uses original text

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@martinscooper - yes, this way you can send a dictionary where the key is the name of the field in the prompt and the value is the field name (or path) in the task data.
e.g. {"instructions":"metadata/template/instruction"} - the prompt will mention context instructions

If a list is sent then the behavior is as before - each item is both the key and the value:
["question"] -> {"question":"question"}

@martinscooper
Copy link
Collaborator

FYI: @elronbandel, @yoavkatz and I have been discussing and working on improving the reported scores for both direct and pairwise evaluators. The changes are included in this PR. You can look at this and this commits for direct score changes, and this other one for pairwise score changes.

In summary:

  • direct evaluators's main_score will be the criteria name (iff available and the criteria is the same for all instances), e.g. faithfulness. If not, a generic main_score llm_as_judge is used. A more granular score is included too, to avoid name conflicts if multiple metrics are used, it is {main_score}_{model_name}_{provider}.
  • pairwise evaluator's main_score is the first system's winrate -called 1_winrate-. All other system's winrate and its mean value are reported too.

@yoavkatz
Copy link
Member

FYI: @elronbandel, @yoavkatz and I have been discussing and working on improving the reported scores for both direct and pairwise evaluators. The changes are included in this PR. You can look at this and this commits for direct score changes, and this other one for pairwise score changes.

In summary:

  • direct evaluators's main_score will be the criteria name (iff available and the criteria is the same for all instances), e.g. faithfulness. If not, a generic main_score llm_as_judge is used. A more granular score is included too, to avoid name conflicts if multiple metrics are used, it is {main_score}_{model_name}_{provider}.
  • pairwise evaluator's main_score is the first system's winrate -called 1_winrate-. All other system's winrate and its mean value are reported too.

Yes, I think we should not change the score names in this PR, and wait for #1467 for the changes.

@lilacheden
Copy link
Member Author

FYI: @elronbandel, @yoavkatz and I have been discussing and working on improving the reported scores for both direct and pairwise evaluators. The changes are included in this PR. You can look at this and this commits for direct score changes, and this other one for pairwise score changes.
In summary:

  • direct evaluators's main_score will be the criteria name (iff available and the criteria is the same for all instances), e.g. faithfulness. If not, a generic main_score llm_as_judge is used. A more granular score is included too, to avoid name conflicts if multiple metrics are used, it is {main_score}_{model_name}_{provider}.
  • pairwise evaluator's main_score is the first system's winrate -called 1_winrate-. All other system's winrate and its mean value are reported too.

Yes, I think we should not change the score names in this PR, and wait for #1467 for the changes.

@yoavkatz - reverted this main score change, can you approve the remaining change?

@@ -725,6 +730,9 @@ def get_instance_results(

winrates = [r["winrate"] for r in per_response_results.values()]
all_results["score"] = max(range(len(winrates)), key=winrates.__getitem__)
all_results[self.main_score] = max(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you remove this change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@elronbandel elronbandel enabled auto-merge (squash) January 16, 2025 09:02
@elronbandel elronbandel merged commit 5506f9c into main Jan 16, 2025
17 of 18 checks passed
@elronbandel elronbandel deleted the fix_llmjudge branch January 16, 2025 09:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants