Skip to content

elliottash/labor-contracts

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Unsupervised Extraction of Workplace Rights and Duties from Collective Bargaining Agreements

This is the (partial) code base for our paper Unsupervised Extraction of Workplace Rights and Duties from Collective Bargaining Agreements. For questions regarding the code base, please contact [email protected].

The code base covers:

  • Parsing articles
  • Computing Authority Scores on statement level
  • Aggregating authority scores on contract level

The main output files of the pipeline are:

  • $output_directory/04_auth.pkl (contains all entitlements/obligations/constraints/permissions per clause-level)
  • $output_directory/05_aggregated.pkl (contains aggregated scores by different subjects on contract-level)

The code base does not cover:

  • Splitting Contracts into articles (we developed a customized solution for converting PDFs to text and splitting labor union contracts which is domain specific and does not translate well to other contracts)
  • Parallelized parsing (the computational bottleneck for large collections is spaCy dependency parsing. We computed this in parallel using linux command line tools and 96 machines)
  • Our analysis which is also heavily customized for Canadian Union Contracts and does not necessarily translate well to other domain (e.g. clustering on article headers and training LDA)

Update December 2023

  • Added aggregating scores per subject on contract level
  • Re-factored how entitlemnets are computed in src/main04_compute_auth.py

Installation

Assuming Anaconda and linux, the environment can be installed with the following command:

conda create -n labor-contracts python=3.6
conda activate labor-contracts
pip install -r requirements.txt

Installing spaCy and linking with neuralcoref

Installing spaCy and linking with neuralcoref does not work out of the box on linux, the following steps eventually worked

git clone https://github.com/huggingface/neuralcoref.git
cd neuralcoref
pip install -r requirements.txt
pip install -e .
pip install spacy==2.3.2
python -m spacy download en

Run the pipeline

The pipeline covers two types of input. In any case, input is a directory containing each contract, either as a .txt file or as a .json file.

  • if input is json, each contract should already be split into "articles", and contain a contract_id.
  • else, rights and duties are computed on contract level, and the filename becomes the article_id

Output will be stored in output_directory, the main output there is the file 04_auth.pkl. For each subject-verb tuple, it contains a boolean value whether it is an entitlement, obligation etc. and saves the "role" of the subject (worker, firm etc.). These results can then be aggregated at any desired level. Intermediate pipeline steps will get saved as well in the output directory.

input_directory="sample_data"
output_directory="output_sample_data"
python src/pipeline.py --input_directory $input_directory --output_directory $output_directory

Our results in the paper are computed with coreferences resolved. However, setting up neuralcoref spacy in 2022 is somewhat cumbersome and the whole process is computationally expensive. If coreference resolution is desired, install neural_coref and run the pipeline with the flug --use_neural_coref

input_directory="path/to/directory/with/contracts"
output_directory="authority_measures"
python src/pipeline.py --use_neural_coref --input_directory $input_directory --output_directory $output_directory

What probably needs to be customized for other contract collections

  • We were interested in very specific roles, e.g. worker, firm etc. This is simply a dictionary lookup of the subject of a clause, e.g. following words are considered to be worker: worker="employee,worker,staff,teacher,nurse,mechanic,operator,steward,personnel" Overwrite these for customized applications in the file main04_compute_auth.py

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages