Tim Woelfle, Julian Hirt, Perrine Janiaud, Ludwig Kappos, John Ioannidis, Lars G Hemkens
Background: It is unknown whether large language models (LLMs) may facilitate time- and resource-intensive text-related processes in evidence appraisal. Objectives: To quantify the agreement of LLMs with human consensus in appraisal of scientific reporting (PRISMA) and methodological rigor (AMSTAR) of systematic reviews and design of clinical trials (PRECIS-2). To identify areas, where human-AI collaboration would outperform the traditional consensus process of human raters in efficiency.
Design: Five LLMs (Claude-3-Opus, Claude-2, GPT-4, GPT-3.5, Mixtral-8x22B) assessed 112 systematic reviews applying the PRISMA and AMSTAR criteria, and 56 randomized controlled trials applying PRECIS-2. We quantified agreement between human consensus and (1) individual human raters; (2) individual LLMs; (3) combined LLMs approach; (4) human-AI collaboration. Ratings were marked as deferred (undecided) in case of inconsistency between combined LLMs or between the human rater and the LLM.
Results: Individual human rater accuracy was 89% for PRISMA and AMSTAR, and 75% for PRECIS-2. Individual LLM accuracy was ranging from 63% (GPT-3.5) to 70% (Claude-3-Opus) for PRISMA, 53% (GPT-3.5) to 74% (Claude-3-Opus) for AMSTAR, and 38% (GPT-4) to 55% (GPT-3.5) for PRECIS-2. Combined LLM ratings led to accuracies of 75-88% for PRISMA (4-74% deferred), 74-89% for AMSTAR (6-84% deferred), and 64-79% for PRECIS-2 (29-88% deferred). Human-AI collaboration resulted in the best accuracies from 89-96% for PRISMA (25/35% deferred), 91-95% for AMSTAR (27/30% deferred), and 80-86% for PRECIS-2 (76/71% deferred).
Conclusions: Current LLMs alone appraised evidence worse than humans. Human-AI collaboration may reduce workload for the second human rater for the assessment of reporting (PRISMA) and methodological rigor (AMSTAR) but not for complex tasks such as PRECIS-2.
Contributions and extensions (e.g. new LLMs or new evidence appraisal tools) are very welcome! However, please do get in touch before starting a project for better alignment.
Structure:
data
contains human ratings and human consensus for each tool as well as all full text files.docs
contains all LLM results and overviews and is calleddocs
only for GitHub Pages to route directly to it.src
contains dependencies.
Adding new LLM experiments for a TOOL (e.g. precis2) generally goes like this:
- Create fulltext folders, e.g.
data/TOOL/fulltext/pdf/txt/
(containing all plain full texts as ID.txt, e.g.data/precis2/fulltext/pdf/648.txt
) ordata/TOOL/fulltext/pdf/png/
(containing a subfolder for every ID with a png for each page namedID_PAGENUM.png
, e.g.data/precis2/fulltext/pdf/png/648/648_1.png
etc.). - Create experiment subfolder in
docs/TOOL/
with subfoldersprompt_template
,responses
, andresults
. Add new experiment todocs/TOOL/params.json
. - Adjust and run
1_call_api_API.py
. - Adjust and run
2_extract_results_TOOL.py
. - Adjust and run
3_render_dashboards.r
for the respective TOOL / experiment. - Adjust and render
docs/index.rmd
.