Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New paper: Elsevier Arena: Human Evaluation of Chemistry/Biology/Health #38

Open
maykcaldas opened this issue Sep 12, 2024 · 0 comments

Comments

@maykcaldas
Copy link
Collaborator

Paper: Elsevier Arena: Human Evaluation of Chemistry/Biology/Health

Authors: Camilo Thorne, Christian Druckenbrodt, Kinga Szarkowska, Deepika

Abstract: The quality and capabilities of large language models cannot be currentlyfully assessed with automated, benchmark evaluations. Instead, humanevaluations that expand on traditional qualitative techniques from naturallanguage generation literature are required. One recent best-practice consistsin using A/B-testing frameworks, which capture preferences of human evaluatorsfor specific models. In this paper we describe a human evaluation experimentfocused on the biomedical domain (health, biology, chemistry/pharmacology)carried out at Elsevier. In it a large but not massive (8.8B parameter)decoder-only foundational transformer trained on a relatively small (135Btokens) but highly curated collection of Elsevier datasets is compared toOpenAI's GPT-3.5-turbo and Meta's foundational 7B parameter Llama 2 modelagainst multiple criteria. Results indicate -- even if IRR scores weregenerally low -- a preference towards GPT-3.5-turbo, and hence towards modelsthat possess conversational abilities, are very large and were trained on verylarge datasets. But at the same time, indicate that for less massive modelstraining on smaller but well-curated training sets can potentially give rise toviable alternatives in the biomedical domain.

Link: https://arxiv.org/abs/2409.05486

Reasoning: Reasoning: Let's think step by step in order to produce the answer. We need to determine if the paper is about a language model. The abstract mentions the evaluation of large language models, specifically comparing a custom model to GPT-3.5-turbo and Llama 2. It discusses the performance and preferences of these models in the biomedical domain. Since the focus is on evaluating and comparing language models, it is clear that the paper is about language models.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant