Draft notes and calculation #4

willpoulett · 2024-12-27T13:19:50Z

No description provided.

willpoulett · 2024-12-27T13:37:31Z

documents/EvaluateLLMJudge.md

+
+An alternative benchmark is presented by the paper [Evaluating Large Language Models at Evaluating Instruction Following](https://arxiv.org/pdf/2310.07641) which creates the LLM-Bar benchmark.   This benchmark focusses on objective performance beyond correctness through both natural and adversarial human annotation.  The work also includes a discussion of prompting strategies and could be useful for considering any prompt instructions are being assessed in terms of preferencing.
+
+MT-bench is a third benchmark described in [Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena](https://arxiv.org/pdf/2306.05685) **DN: More on this**


Does this need adding to?

willpoulett · 2024-12-27T13:38:05Z

documents/EvaluateLLMJudge.md

+Overall, all of these benchmarks highlight limitations and are not recommended for specific use-cases but only to support initial development stages and to provide some additional considerations when choosing a model for the judge.
+
+## 2.	Metrics to assess against human evaluations
+Alignment metrics are used to compare the evaluations from human judges against LLM judges.  These look at the amount of overlap **DN: Sentence of alignement metrics**


Does this also need adding to?

willpoulett · 2024-12-27T13:39:18Z

documents/EvaluateLLMJudge.md

+## 3.	Identify strategies to improve the judge
+[Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena](https://arxiv.org/pdf/2306.05685) & [JudgeLM: Fine-tuned large language models are scalable judges](https://arxiv.org/pdf/2310.17631) identify a range of potential biases that judges can exhibit.   These include:
+
+-	Position bias (**DN: MORE**


Positional and Knowledge bias need a quick sentence

willpoulett · 2024-12-27T14:04:05Z

documents/EvaluateLLMJudge.md

+These three mitigations are shown to enhance the judge of interest.
+
+## Additional Considerations
+[When combinations of humans and AI are useful: A systematic review and meta-analysis](https://www.nature.com/articles/s41562-024-02024-1) finds an interesting conclusion from a systematic review of 106 studies for performance of evaluations by humans-alone vs AI alone vs human-AI combinations.   It summarised that Human-ai performed worse than the best of human alone or ai alone, but when humans alone outperformed AI alone, they found performance gains in the combination but when AI outperformed humans alone, they found losses.   The also found a trend of negative performance for human-ai synergy for classification tasks whilst a positive uplift in performance for creative tasks.


I found this a little confusing:
but when humans alone outperformed AI alone, they found performance gains in the combination but when AI outperformed humans alone, they found losses.

Draft notes and calculation

5fa2097

willpoulett commented Dec 27, 2024

View reviewed changes

willpoulett assigned jrpearson-nhs Dec 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Draft notes and calculation #4

Draft notes and calculation #4

willpoulett commented Dec 27, 2024

willpoulett Dec 27, 2024

willpoulett Dec 27, 2024

willpoulett Dec 27, 2024

willpoulett Dec 27, 2024


		An alternative benchmark is presented by the paper [Evaluating Large Language Models at Evaluating Instruction Following](https://arxiv.org/pdf/2310.07641) which creates the LLM-Bar benchmark. This benchmark focusses on objective performance beyond correctness through both natural and adversarial human annotation. The work also includes a discussion of prompting strategies and could be useful for considering any prompt instructions are being assessed in terms of preferencing.

		MT-bench is a third benchmark described in [Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena](https://arxiv.org/pdf/2306.05685) DN: More on this

Draft notes and calculation #4

Are you sure you want to change the base?

Draft notes and calculation #4

Conversation

willpoulett commented Dec 27, 2024

willpoulett Dec 27, 2024

Choose a reason for hiding this comment

willpoulett Dec 27, 2024

Choose a reason for hiding this comment

willpoulett Dec 27, 2024

Choose a reason for hiding this comment

willpoulett Dec 27, 2024

Choose a reason for hiding this comment