This article explores into the potential applications and implications of large language models (LLMs) for historical research and pedagogy. The article examines the capabilities of GPT-4 and other machine learning models through case studies studying their utility in fostering greater accessibility to historical sources. GPT-4's performance is evaluated on a series of prompted tasks including data preparation, source analysis, and the ethical implications of simulated historical worldviews. GPT-4's proficiency in historical knowledge is also evaluated by using a widely recognized machine learning benchmark. A replication study demonstrates that GPT-4 exhibits expert-level performance in three distinct historical subfields. Given the rapid advances in LLMs, historians should contribute wider debates surrounding these technologies, as the unpredictable impacts of democratized AI on historical knowledge are already emerging.
In the article's hermeneutical layer, the author explores the practice of prompt engineering, or the techniques for using natural language instructions to guide a LLM's output. Prompt engineering strategies are demonstrated through the use of few-shot prompting, chain-of-thought reasoning, and prompt chaining.
Large language models, GPT-4, artifical intelligence, machine learning, historical methodology, optical character recognition, oral history, prompt engineering