Skip to content

Latest commit

 

History

History
159 lines (110 loc) · 6.6 KB

README.md

File metadata and controls

159 lines (110 loc) · 6.6 KB

European Court of Human Rights OpenData construction process

This repository contains the scripts to build the database and datasets from the European Court of Human Rights OpenData (ECHR-OD) project. The purposes of such repository are many:

  1. Reproducibility: everyone can rebuild the entire database from scratch,

  2. Extensibility: any new version of the database must be created from a updated version of those scripts.

  3. Revision: all cases are automatically processed. There are many corner cases and such repository allow anyone to check the intermediate files to understand if the results are correct or not and locate the root cause of parsing errors.

DOWNLOAD DATA

Codacy Badge

General information

Following the project and getting help

  • Mailing list: https://groups.google.com/g/echr-od
    Receive the important announcements about the project (max. couple of emails per year)

  • Discussion:
    Get help and/or discuss about the project. Matrix is a decentralized, secured and open-source real-time communication system.

Citing

If you are using the project, please consider citing:

@article{ECHRDB,
  title        = {On Integrating and Classifying Legal Text Documents},
  author       = {Quemy, A. and Wrembel, R.},
  year         = 2020,
  journal      = {International Conference on Database and Expert Systems Applications (DEXA)}
}

Versioning and deployment

There are two distinct type of versions:

  1. Semantic versioning (e.g. 2.0.1) that indicates the version of the process. It relates only to the code and the type of data available.

    1. major revision indicates a change in the type of version available
    2. minor and patches related concern bugfix and improvements
  2. Date of release (e.g. 2020-11-01), that indicates a when a build has been generated.

The database is meant to be updated every month with new cases. New releases are built upon an image created from the latest sources. Therefore, the date of release is technically enough to identify the semantic versioning. However, semantic versioning helps the maintainers and contributors with the development.

Installation & Usage

Recreating the database requires docker.

To build the environment image:

docker build -f Dockerfile -t echr_build .

As long as dependencies are not changed, there is no need to rebuild the image.

Once the image is built, the container help can be displayed with:

docker run -ti --mount src=$(pwd),dst=/tmp/echr_process/,type=bind echr_build -h

In particular, to build the database:

docker run -ti --mount src=$(pwd),dst=/tmp/echr_process/,type=bind echr_build build

Build, Steps & Workflow

The entrypoint of the Extract-transform-load (ETL) process is build.py.
The different ETL steps can be found in the subfolder echr/steps.

The main build script load a workflow made of steps and execute each of them. Workflows are YAML files and can be found in the folder workflows.

The workflows provided with the project are:

  • Local (local.yml): full ETL build locally,
  • Release (release.yml): full ETL including deployment to the server,
  • Database (database.yml): build the database only (no NLP model, no datasets),
  • Datasets (datasets.yml): build the datasets only (does not generate the database),
  • NLP Model (NLP_model.yml): build only the NLP model,
  • Runner (runner.yml): execute a workflow on an external runner.

We have the following relations:

  • Datasets = NLP Model + datasets generation step
  • Local = Database + Datasets
  • Release = Local + deployment step

This separation have been made because generating the NLP model takes up 95% of the whole Release workflow time and a tremendous amount of RAM (>16 Go).

Workflows may define variables using uppercase name starting by $ (e.g. $MAX_DOCUMENTS). The variables are replaced during the build process using the following order of priority:

  1. Environment variable
  2. CLI parameter
  3. From the configuration file, under build.env.
  4. Global variable defined in build.py

Configuration

The general configuration file is config.yml and contains three parts:

  1. logging: related to logging files

  2. steps: configuration for each step on top of the workflow

  3. build: specific build configuration, in particular the section env contains the variables available to the whole workflow

Logs

There are two log files:

  1. The build log file: build/<build>/logs/build.html and build/<build>/logs/build.txt
  2. The process log file, mostly used for debug: logs/build.log

Tests & Coverage

To run the tests:

docker run -ti --mount src=$(pwd),dst=/tmp/echr_process/,type=bind echr_build test

Versions

Contributors