Code for the paper "LEDGAR: A Large-Scale Multilabel Corpus for Text Classification of Legal Provisions in Contracts"
Link to the paper: https://www.aclweb.org/anthology/2020.lrec-1.155.pdf
- Python 3.7+
- to install requirements:
pip install -r requirements.txt
-
The full corpus as a zipped jsonl file is located here.
-
for the MLP (+Attention) classification experiments you will also need pretrained MUSE embeddings from here.
-
prepare word embeddings:
python convert_embedding_txt.py /path/to/wiki.multi.en.vec
this will createwiki.multi.en.vec_data.npy
andwiki.multi.en.vec_vocab.json
in the same folder.
-
creating the different sub-corpora:
python corpus_analysis_and_sampling.py /path/to/LEDGAR_2016-2019_clean.jsonl
-
to run the classification baselines, navigate do the classification sub folder:
cd classification
-
Logistic Regression:
python classification_baselines.py /path/to/sub-corpus.jsonl
-
MLP:
python mlp_classifier.py /path/to/sub-corpus.jsonl /path/to/wiki.multi.en.vec_data.npy /path/to/wiki.multi.en.vec_vocab.json
-
MLP + Attention:
python mlp_classifier_attention.py /path/to/sub-corpus.jsonl /path/to/wiki.multi.en.vec_data.npy /path/to/wiki.multi.en.vec_vocab.json
-
DistilBert:
python distilbert_baseline.py --data /path/to/sub-corpus.jsonl --mode train
for more detailed instructions consult this readme.
Thanks for citing our paper should you use the corpus!
@inproceedings{tuggener2020ledgar,
title={LEDGAR: a large-scale multi-label corpus for text classification of legal provisions in contracts},
author={Tuggener, Don and von D{\"a}niken, Pius and Peetz, Thomas and Cieliebak, Mark},
booktitle={12th Language Resources and Evaluation Conference (LREC) 2020},
pages={1228--1234},
year={2020},
organization={European Language Resources Association}
}