Code for the paper "LEDGAR: A Large-Scale Multilabel Corpus for Text Classification of Legal Provisions in Contracts"

Link to the paper: https://www.aclweb.org/anthology/2020.lrec-1.155.pdf

Requirements:

Python 3.7+
to install requirements: pip install -r requirements.txt

Data:

The full corpus as a zipped jsonl file is located here.
for the MLP (+Attention) classification experiments you will also need pretrained MUSE embeddings from here.
prepare word embeddings: python convert_embedding_txt.py /path/to/wiki.multi.en.vec this will create wiki.multi.en.vec_data.npy and wiki.multi.en.vec_vocab.json in the same folder.

Usage:

creating the different sub-corpora: python corpus_analysis_and_sampling.py /path/to/LEDGAR_2016-2019_clean.jsonl
to run the classification baselines, navigate do the classification sub folder: cd classification
Logistic Regression: python classification_baselines.py /path/to/sub-corpus.jsonl
MLP: python mlp_classifier.py /path/to/sub-corpus.jsonl /path/to/wiki.multi.en.vec_data.npy /path/to/wiki.multi.en.vec_vocab.json
MLP + Attention: python mlp_classifier_attention.py /path/to/sub-corpus.jsonl /path/to/wiki.multi.en.vec_data.npy /path/to/wiki.multi.en.vec_vocab.json
DistilBert: python distilbert_baseline.py --data /path/to/sub-corpus.jsonl --mode train for more detailed instructions consult this readme.

Citation:

Thanks for citing our paper should you use the corpus!

@inproceedings{tuggener2020ledgar,
  title={LEDGAR: a large-scale multi-label corpus for text classification of legal provisions in contracts},
  author={Tuggener, Don and von D{\"a}niken, Pius and Peetz, Thomas and Cieliebak, Mark},
  booktitle={12th Language Resources and Evaluation Conference (LREC) 2020},
  pages={1228--1234},
  year={2020},
  organization={European Language Resources Association}
}

Name		Name	Last commit message	Last commit date
Latest commit History 128 Commits
classification		classification
LICENSE		LICENSE
README.md		README.md
convert_embedding_txt.py		convert_embedding_txt.py
corpus_analysis_and_sampling.py		corpus_analysis_and_sampling.py
corpus_cleaning.py		corpus_cleaning.py
heuristic_filtering.py		heuristic_filtering.py
labelset_hierarchy.py		labelset_hierarchy.py
labelset_processing.py		labelset_processing.py
requirements.txt		requirements.txt
sample_nda_provisions.py		sample_nda_provisions.py
sec_crawler.py		sec_crawler.py
sec_scraper.py		sec_scraper.py
subsample_corpora.py		subsample_corpora.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Code for the paper "LEDGAR: A Large-Scale Multilabel Corpus for Text Classification of Legal Provisions in Contracts"

Requirements:

Data:

Usage:

Citation:

About

Releases

Packages

Languages

License

dtuggener/LEDGAR_provision_classification

Folders and files

Latest commit

History

Repository files navigation

Code for the paper "LEDGAR: A Large-Scale Multilabel Corpus for Text Classification of Legal Provisions in Contracts"

Requirements:

Data:

Usage:

Citation:

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages