You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Paper: Elsevier Arena: Human Evaluation of Chemistry/Biology/Health
Authors: Camilo Thorne, Christian Druckenbrodt, Kinga Szarkowska, Deepika
Abstract: The quality and capabilities of large language models cannot be currentlyfully assessed with automated, benchmark evaluations. Instead, humanevaluations that expand on traditional qualitative techniques from naturallanguage generation literature are required. One recent best-practice consistsin using A/B-testing frameworks, which capture preferences of human evaluatorsfor specific models. In this paper we describe a human evaluation experimentfocused on the biomedical domain (health, biology, chemistry/pharmacology)carried out at Elsevier. In it a large but not massive (8.8B parameter)decoder-only foundational transformer trained on a relatively small (135Btokens) but highly curated collection of Elsevier datasets is compared toOpenAI's GPT-3.5-turbo and Meta's foundational 7B parameter Llama 2 modelagainst multiple criteria. Results indicate -- even if IRR scores weregenerally low -- a preference towards GPT-3.5-turbo, and hence towards modelsthat possess conversational abilities, are very large and were trained on verylarge datasets. But at the same time, indicate that for less massive modelstraining on smaller but well-curated training sets can potentially give rise toviable alternatives in the biomedical domain.
Reasoning: Reasoning: Let's think step by step in order to produce the answer. We need to determine if the paper is about a language model. The abstract mentions the evaluation of large language models, specifically comparing a custom model to GPT-3.5-turbo and Llama 2. It discusses the performance and preferences of these models in the biomedical domain. Since the focus is on evaluating and comparing language models, it is clear that the paper is about language models.
The text was updated successfully, but these errors were encountered:
Paper: Elsevier Arena: Human Evaluation of Chemistry/Biology/Health
Authors: Camilo Thorne, Christian Druckenbrodt, Kinga Szarkowska, Deepika
Abstract: The quality and capabilities of large language models cannot be currentlyfully assessed with automated, benchmark evaluations. Instead, humanevaluations that expand on traditional qualitative techniques from naturallanguage generation literature are required. One recent best-practice consistsin using A/B-testing frameworks, which capture preferences of human evaluatorsfor specific models. In this paper we describe a human evaluation experimentfocused on the biomedical domain (health, biology, chemistry/pharmacology)carried out at Elsevier. In it a large but not massive (8.8B parameter)decoder-only foundational transformer trained on a relatively small (135Btokens) but highly curated collection of Elsevier datasets is compared toOpenAI's GPT-3.5-turbo and Meta's foundational 7B parameter Llama 2 modelagainst multiple criteria. Results indicate -- even if IRR scores weregenerally low -- a preference towards GPT-3.5-turbo, and hence towards modelsthat possess conversational abilities, are very large and were trained on verylarge datasets. But at the same time, indicate that for less massive modelstraining on smaller but well-curated training sets can potentially give rise toviable alternatives in the biomedical domain.
Link: https://arxiv.org/abs/2409.05486
Reasoning: Reasoning: Let's think step by step in order to produce the answer. We need to determine if the paper is about a language model. The abstract mentions the evaluation of large language models, specifically comparing a custom model to GPT-3.5-turbo and Llama 2. It discusses the performance and preferences of these models in the biomedical domain. Since the focus is on evaluating and comparing language models, it is clear that the paper is about language models.
The text was updated successfully, but these errors were encountered: