This resource documents the design and operation of the Judgment Enrichment Pipeline (JEP) built for The National Archives by MDRxTECH and vLex Justis to support the publishing process that sits behind the Find Case Law platform.
The primary purpose of the JEP is to "enrich" the judgments published on Find Case Law by marking up important pieces of legal information - such as references to earlier cases and legislation - cited in the body of the judgment. In certain scenarios described elsewhere in this documentation, the JEP will "repair" or resolve entities that are malformed whilst respecting the original text of the judgment in question.
At its core, the JEP is a series of serverless functions, which we call Annotators, that sequentially add layers of markup to judgments submitted for enrichment. Each Annotator is responsible for performing a specific type of enrichment. For example, the Case Law Annotator detects references to case law citations (such as [2021] 1 WLR 1
) and the Legislation Annotator is responsible for marking up mentions of UK primary legislation. An overview of the Annotators can be found below with more detailed notes on each set out in dedicated documentation in this folder.
A comprehensive map of the JEP's architecture can be found here
The Annotators are supported by a cast of utility functions that are responsible for ETL, XML validation, rules and data management and file manipulation. The most important of these utility functions are the Replacers, which generate the enriched XML that is sent back for publication on Find Case Law.
A significant amount of core markup annotation is provided directly by the JEP, but it is also supported an integration with the vLex vCite engine. vCite extends the JEP's functionality in a range of ways, including the addition of a comprehensive suite of case law citation matchers. See here for more detail on the vCite integration and how it is controlled.
An example enriched snippet of LegalDocML feature case law citation markup looks like this:
<ref href="https://caselaw.nationalarchives.gov.uk/ewca/civ/2021/1308" uk:canonical="[2021] EWCA Civ 1308" uk:isneutral="true" uk:type="case" uk:year="2021" uk:origin="TNA">[2021] EWCA Civ 1308</ref>, <ref href="#" uk:canonical="[2022] 1 WLR 1585" uk:isneutral="false" uk:type="case" uk:year="2022" uk:origin="TNA">[2022] 1 WLR 1585</ref>
The JEP is a modular system comprising a series of AWS Lambda functions -- the Annotators -- that are each responsible for performing a discrete step in the enrichment pipeline. The five Annotator functions are:
- Case Law Annotator -- detects references to UK case law citations, such as
[2022] 1 WLR 123
- Legislation Annotator -- detects references to UK primary legislation, such as
Theft Act 1968
- Abbreviation Annotator -- detects abbreviations and resolves them to their longform. For example, the longform of
HRA 1998
isHuman Rights Act 1998
- Oblique Legislative References Annotator -- detects indirect references to primary legislaton, such as
the Act
orthe 1998 Act
and determines which cited primary enactment the indirect reference corresponds to - Legislative Provision Annotator -- identifies references to legislation provisions, such as
section 6
, and identifies the corresponding primary enactment, for examplesection 6 of the Human Rights Act
There are four phases of enrichment. Each phase of enrichment generates enriched LegalDocML that is progressively enriched by each successive phase of enrichment.
First Phase Enrichment The first phase of enrichment consists of the Case Law Annotator; the Legislation Annotator; and the Abbreviations Annotator
Second Phase Enrichment The second phase of enrichment consists of the Oblique Legislative References Annotator
Third Phase Enrichment The third phase of enrichment consists of the Legislative Provision Annotator
Fourth Phase Enrichment The fourth and final phase of enrichment consists of the vCite integration
The Replacers are responsible for registering the various entities detected by the Annotators, including their entity types and position in the judgment body. The registered replacements are then applied to the judgment body through a series of string manipulations by the make_replacements
lambda.
There are two sets of replacer logic. The first set provides the logic for first phase enrichment replacements. The second set of replacer logic handles replacement in the second and third phases of enrichment.
It is possible for the same judgment to be submitted for enrichment on multiple occasions, which creates the risk that existing enrichment present in the judgment will break as additional enrichment is added to the judgment. To address this, the JEP "sanitises" the judgment body prior to making replacements. The sanitisation process is simply performed by stripping existing </ref>
tags from the judgment. This logic is handled in the make_replacements lambda.
IMPORTANT: the sanitisation step does not currently distinguish between enrichment supplied by the JEP itself, by vCite or from some other source! Particular care should be taken to avoid inadvertently removing vCite enrichment by re-enriching a judgment that includes vCite enrichment when the vCite integration is switched off.
The Case Law Annotator uses a rules-based engine, the Rules Manifest, which is built on top of the spaCy EntityRuler to detect case law citations (e.g. `[2022] 1 WLR 123). The Rules Manifest is stored as a table in Postgres where each row in the table represents a rule.
The creation of rules is currently managed by modifying and uploading a CSV version of the Rules Manifest, which is stored in production-tna-s3-tna-sg-rules-bucket
with a filename conforming to the pattern yyyy_mm_dd_Citation_Manifest.csv
.
See here for guidance on how to create and modify rules in the Rules Manifest.
There are two ways to operate the pipeline:
- Triggering the pipeline via file upload to S3
- API integration with the MarkLogic database
The JEP can be operated manually by uploading judgments directly to the JEP's trigger S3 bucket: s3://production-tna-s3-tna-sg-xml-original-bucket/
. We recommend using the AWS CLI to achieve this, like so:
aws s3 cp path/to/judgment.xml s3://production-tna-s3-tna-sg-xml-original-bucket/
The enrichment process typically takes five-six minutes per judgment. Enriched judgment XML is deposited in the JEP's terminal bucket: s3://production-tna-s3-tna-sg-xml-third-phase-enriched-bucket
. Again, we recommend using the AWS CLI to retrieve the enriched XML, like so:
aws s3 cp s3://production-tna-s3-tna-sg-xml-third-phase-enriched-bucket/judgment.xml path/to/local/dir
The standard mechanism for triggering the enrichment pipeline is via the TNA editor interface.
The VCite integration is shown more distinctly in the diagram below:
CI/CD works in the following way:
- Engineer branches from
main
branch, commits code and raises a pull request.- The code within the repo is checked using the tools defined in
.pre-commit-config.yaml
. - The Terraform code is checked against a linting tool called
TFLint
. - Terraform is validated and planned against staging and production as independent checks.
- The code within the repo is checked using the tools defined in
- Upon merge, non dockerised lambdas are built, terraform is planned, applied and then docker images are built and pushed to ECR. This occurs for staging, if staging succeeds then the same happens for production.
- When a pull request is opened a series of checks are made, against both staging and production:
- Python Black (Formats python code correctly)
- Python iSort (Orders imports correctly)
- TFLint (Terraform Linter)
- Terraform Validate
- Terraform init.
- Terraform Plan (A plan of the infrastructure changes for that environment)
- If the checks fail at the pre-commit stage you can usually fix these with
pre-commit run --all-files
and committing the changes. Problems which can't be auto-fixed will be explained. - TFLint will explain any errors it finds.
- Terraform plan needs to be inspected before merging code to ensure the right thing is being applied. Do not assume that a green build is going to build what you want to be built.
- Upon merge, staging environment docker images will be built and pushed to ECR, staging environment Terraform code will be applied. On success of the staging environment, production environment docker images will be built and pushed to ECR, production environment Terraform code will be applied.
As we use AWS Aurora, there is no multi-AZ functionality. Instead, “Aurora automatically replicates storage six ways across three availability zones”.
Each night there is an automated snapshot by Amazon of RDS. We also run a manual snapshot of the cluster at midday (UTC) each day. This is cron-based from Amazon Eventbridge that triggers a lambda. DB backups are shown in the RDS console under manual snapshots.
Here are some brief notes on extending the infrastructure.
- The file
terraform/main.tf
will invoke each of the modules, more of the same services can be created by adding to those modules. If more modules are created thenterraform/main.tf
will need to be extended to invoke them. - Adding an S3 bucket is done by invoking the
secure_bucket
module, located atterraform/modules/secure_bucket/
, you can see how the existing buckets are created by viewingterraform/modules/lambda_s3/bucket.tf
, new buckets should be created by adding to this file. If a bucket policy is added, then an extra statement will automatically be added that denies insecure transport. - Docker images are stored in ECR. Each repo needs to exist before a docker image can be pushed to ECR. These are created in
terraform/modules/lambda_s3/lambda.tf
.
You can find auto-generated documentation on the Terraform resources in terraform/README.md.
- Run
terraform init --upgrade -backend=false
locally