Benchmarking Human-AI Collaboration for Common Evidence Appraisal Tools

Tim Woelfle, Julian Hirt, Perrine Janiaud, Ludwig Kappos, John Ioannidis, Lars G Hemkens

Abstract

Background: It is unknown whether large language models (LLMs) may facilitate time- and resource-intensive text-related processes in evidence appraisal. Objectives: To quantify the agreement of LLMs with human consensus in appraisal of scientific reporting (PRISMA) and methodological rigor (AMSTAR) of systematic reviews and design of clinical trials (PRECIS-2). To identify areas, where human-AI collaboration would outperform the traditional consensus process of human raters in efficiency.

Design: Five LLMs (Claude-3-Opus, Claude-2, GPT-4, GPT-3.5, Mixtral-8x22B) assessed 112 systematic reviews applying the PRISMA and AMSTAR criteria, and 56 randomized controlled trials applying PRECIS-2. We quantified agreement between human consensus and (1) individual human raters; (2) individual LLMs; (3) combined LLMs approach; (4) human-AI collaboration. Ratings were marked as deferred (undecided) in case of inconsistency between combined LLMs or between the human rater and the LLM.

Results: Individual human rater accuracy was 89% for PRISMA and AMSTAR, and 75% for PRECIS-2. Individual LLM accuracy was ranging from 63% (GPT-3.5) to 70% (Claude-3-Opus) for PRISMA, 53% (GPT-3.5) to 74% (Claude-3-Opus) for AMSTAR, and 38% (GPT-4) to 55% (GPT-3.5) for PRECIS-2. Combined LLM ratings led to accuracies of 75-88% for PRISMA (4-74% deferred), 74-89% for AMSTAR (6-84% deferred), and 64-79% for PRECIS-2 (29-88% deferred). Human-AI collaboration resulted in the best accuracies from 89-96% for PRISMA (25/35% deferred), 91-95% for AMSTAR (27/30% deferred), and 80-86% for PRECIS-2 (76/71% deferred).

Conclusions: Current LLMs alone appraised evidence worse than humans. Human-AI collaboration may reduce workload for the second human rater for the assessment of reporting (PRISMA) and methodological rigor (AMSTAR) but not for complex tasks such as PRECIS-2.

Contributions

Contributions and extensions (e.g. new LLMs or new evidence appraisal tools) are very welcome! However, please do get in touch before starting a project for better alignment.

Structure:

data contains human ratings and human consensus for each tool as well as all full text files.
docs contains all LLM results and overviews and is called docs only for GitHub Pages to route directly to it.
src contains dependencies.

Adding new LLM experiments for a TOOL (e.g. precis2) generally goes like this:

Create fulltext folders, e.g. data/TOOL/fulltext/pdf/txt/ (containing all plain full texts as ID.txt, e.g. data/precis2/fulltext/pdf/648.txt) or data/TOOL/fulltext/pdf/png/ (containing a subfolder for every ID with a png for each page named ID_PAGENUM.png, e.g. data/precis2/fulltext/pdf/png/648/648_1.png etc.).
Create experiment subfolder in docs/TOOL/ with subfolders prompt_template, responses, and results. Add new experiment to docs/TOOL/params.json.
Adjust and run 1_call_api_API.py.
Adjust and run 2_extract_results_TOOL.py.
Adjust and run 3_render_dashboards.r for the respective TOOL / experiment.
Adjust and render docs/index.rmd.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
data		data
docs		docs
src		src
.gitignore		.gitignore
1_call_api_anthropic.py		1_call_api_anthropic.py
1_call_api_openai.py		1_call_api_openai.py
1_call_api_openrouter.py		1_call_api_openrouter.py
2 extract_results_prisma_amstar.py		2 extract_results_prisma_amstar.py
2_extract_results_precis2.py		2_extract_results_precis2.py
3_render_dashboards.r		3_render_dashboards.r
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Benchmarking Human-AI Collaboration for Common Evidence Appraisal Tools

Abstract

Contributions

About

Languages

timwoelfle/Evidence-Appraisal-AI

Folders and files

Latest commit

History

Repository files navigation

Benchmarking Human-AI Collaboration for Common Evidence Appraisal Tools

Abstract

Contributions

About

Resources

Stars

Watchers

Forks

Languages