-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft notes and calculation #4
base: main
Are you sure you want to change the base?
Conversation
|
||
An alternative benchmark is presented by the paper [Evaluating Large Language Models at Evaluating Instruction Following](https://arxiv.org/pdf/2310.07641) which creates the LLM-Bar benchmark. This benchmark focusses on objective performance beyond correctness through both natural and adversarial human annotation. The work also includes a discussion of prompting strategies and could be useful for considering any prompt instructions are being assessed in terms of preferencing. | ||
|
||
MT-bench is a third benchmark described in [Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena](https://arxiv.org/pdf/2306.05685) **DN: More on this** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need adding to?
Overall, all of these benchmarks highlight limitations and are not recommended for specific use-cases but only to support initial development stages and to provide some additional considerations when choosing a model for the judge. | ||
|
||
## 2. Metrics to assess against human evaluations | ||
Alignment metrics are used to compare the evaluations from human judges against LLM judges. These look at the amount of overlap **DN: Sentence of alignement metrics** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this also need adding to?
## 3. Identify strategies to improve the judge | ||
[Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena](https://arxiv.org/pdf/2306.05685) & [JudgeLM: Fine-tuned large language models are scalable judges](https://arxiv.org/pdf/2310.17631) identify a range of potential biases that judges can exhibit. These include: | ||
|
||
- Position bias (**DN: MORE** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Positional and Knowledge bias need a quick sentence
These three mitigations are shown to enhance the judge of interest. | ||
|
||
## Additional Considerations | ||
[When combinations of humans and AI are useful: A systematic review and meta-analysis](https://www.nature.com/articles/s41562-024-02024-1) finds an interesting conclusion from a systematic review of 106 studies for performance of evaluations by humans-alone vs AI alone vs human-AI combinations. It summarised that Human-ai performed worse than the best of human alone or ai alone, but when humans alone outperformed AI alone, they found performance gains in the combination but when AI outperformed humans alone, they found losses. The also found a trend of negative performance for human-ai synergy for classification tasks whilst a positive uplift in performance for creative tasks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found this a little confusing:
but when humans alone outperformed AI alone, they found performance gains in the combination but when AI outperformed humans alone, they found losses.
No description provided.