- This is an official repository for the KBL dataset from LBox.
- The work will be presented at EMNLP 2024 Findings and the NLLP Workshop.
- The paper is available from here.
-
Released the benchmark on the Hugging Face Hub
- To access the dataset for RAG tasks, visit here
-
Release the Korean statutes and precedents corpus for RAG experiment.
-
Release the RAG task
yaml
files.-
Currently, due to technical difficulty, evaluating LLMs under the RAG setting is possible with given retrived documents using a custom branch of
lm-eval-harness
. -
LRAGE, a RAG evaluation tool specifically tailored for the legal domain, is under active construction. The full features will be supported around Dec 15 2024. Please check the tool from here.
-
-
Make the yaml files and corresponding utils available in
lm-evaluation-harness
repository -
Share the data processing script for RAG experiments.
-
Present the paper at EMNLP 2024.
-
Release yaml files for
multiple_choice
type evaluations.
from pprint import pprint
import datasets
data = datasets.load_dataset("lbox/kbl", data_files={"test": [FILE_PATH]})
# Example
# data = datasets.load_dataset('lbox/kbl', data_files={"test": "knowledge/kbl_legal_concept_qa_v0.1.json"})["test"]
pprint(data[0])
- Korean statutes (220,160 articles. Dumped at Nov2024)
- Korean precedents (From LBox-Open)
from pprint import pprint
import datasets
# Load statutes corpus
data = datasets.load_dataset('lbox/kbl-rag', data_files={"train": "corpus/statutes.jsonl"})["train"]
# Load precedents corpus
# data = datasets.load_dataset('lbox/kbl', data_files={"train": "corpus/precedents.jsonl"})["train"]
# Load precedents and statutes corpus
# data = datasets.load_dataset('lbox/kbl', data_files={"train": "corpus/precedents_and_statutes.jsonl"})["train"]
pprint(data[0])
@inproceedings{kim2024kbl,
title = "Developing a Pragmatic Benchmark for Assessing {K}orean Legal Language Understanding in Large Language Models",
author = {Yeeun Kim and Young Rok Choi and Eunkyung Choi and Jinhwan Choi and Hai Jin Park and Wonseok Hwang},
editor = "Al-Onaizan, Yaser and
Bansal, Mohit and
Chen, Yun-Nung",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2024",
month = nov,
year = "2024",
address = "Miami, Florida, USA",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.findings-emnlp.319",
pages = "5573--5595",
}