Skip to content

Commit

Permalink
Merge pull request #641 from TheJacksonLaboratory/input-sanitation
Browse files Browse the repository at this point in the history
Input sanitation
  • Loading branch information
ielis authored Dec 27, 2023
2 parents b3a1ede + 14173e0 commit f8ee662
Show file tree
Hide file tree
Showing 86 changed files with 1,713 additions and 847 deletions.
1 change: 1 addition & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ as `Human Phenotype Ontology (HPO) <http://www.human-phenotype-ontology.org>`_ t
setup
tutorial
running
input-sanitation
output
explanations
advanced
Expand Down
179 changes: 179 additions & 0 deletions docs/input-sanitation.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,179 @@
.. _rst-input-sanitation:

=========================
Analysis input validation
=========================

LIRICAL performs Q/C checks and sanitation before running the analysis.

Here we summarize the requirements and checks performed on all sections of the analysis input.

Analysis requirements
^^^^^^^^^^^^^^^^^^^^^

Here we summarize the requirements of inputs that LIRICAL needs for the analysis.

Sample identifier
~~~~~~~~~~~~~~~~~

Sample identifier MUST be provided if the analysis is run with a multi-sample VCF file. Otherwise, LIRICAL is unable
to choose the variant genotypes.
The identifier is *optional* if running a phenotype-only analysis or with a single-sample VCF file,
where LIRICAL uses the identifier found in the VCF file.

The analysis will stop if run with multi-sample VCF file and the identifier is not available,
or if the provided identifier is not found in the VCF file (applies to single-sample VCFs as well).


Phenotypic features
~~~~~~~~~~~~~~~~~~~

LIRICAL uses a set of phenotypic features that were observed or specifically excluded in the subject to prioritize
the diseases and several checks are applied to mitigate common errors ensure correctness of the analysis.

The checks focus on the following:

- At least one present or excluded HPO term is provided.
- All phenotypic features are formatted as *Compact Uniform Resource Identifiers* (CURIEs), such as ``HP:0001250``
for *Seizure*. A valid CURIE consists of a prefix (e.g. ``HP``), delimiter (``:`` or ``_``), and id (e.g. ``0001250``).
- The CURIEs are *unique*, i.e. used at most once.
- The CURIEs correspond to identifiers of *current* or *obsolete* HPO terms.
- The HPO terms are descendants of `Phenotypic abnormality <https://hpo.jax.org/app/browse/term/HP:0000118>`_ branch.
- The HPO terms are logically consistent:

- The subject is not annotated with an HPO term in observed and excluded state at the same time.
- The subject is not annotated with an observed HPO term and its observed or excluded ancestor.
- The subject is not annotated with an excluded HPO term and its excluded ancestor.

Age
~~~

LIRICAL does not use the age of the subject at the moment. However, if set, the age must be formatted
as ISO8601 duration. For instance ``P1Y8M`` for 1 year and 8 months of age.

Sex
~~~

The sex must be provided as one of {``MALE``, ``FEMALE``, ``UNKNOWN``}. If the input is not parsable,
``UNKNOWN`` is used by default.

VCF file
~~~~~~~~

The path to VCF file can be provided via CLI or through phenopacket/YAML file. The path must point to a file
that is readable by the user running the LIRICAL process.


Validation policy
^^^^^^^^^^^^^^^^^

LIRICAL enforces the requirements depending on the validation policy. There are three validation policies:

- *MINIMAL*
- *LENIENT*
- *STRICT*

with increasing sanity requirements.

The input validation results are always logged in the output. The log includes the following line if the input is OK::

Input sanitation found no issues

Alternatively, the issues are logged to the terminal. For instance::

Found issues 0 errors and 1 warnings
Errors 😱
- Sample must not be annotated with Clonic seizure [HP:0020221] while its ancestor
Seizure [HP:0001250] is excluded. Resolve the logical inconsistency by choosing
one of the terms.
Warnings 😧
- Sample should not be annotated with Patent foramen ovale [HP:0001655] and its ancestor
Atrial septal defect [HP:0001631]. Remove Atrial septal defect [HP:0001631]
from the phenotype terms.

The issues are classified as errors and warnings.
*Error* is a serious issue that MUST be fixed and human intervention is required.
Warning is an issue that SHOULD be fixed. However, unlike an error, warning can be fixed automatically.
The output includes a suggested resolution, e.g. choosing *Clonic seizure* or *Seizure* in the error above.

The warnings are be fixed depending on the used validation policy.


`MINIMAL` validation policy
~~~~~~~~~~~~~~~~~~~~~~~~~~~

Minimal validation policy enforces the least constraint on the analysis inputs.
The analysis is run *"as is"* and the run is aborted only if the most important information is missing.
Only the most rudimentary sanitation is applied.

Requirements
############

The analysis requires the following:

- At least one HPO term is provided.
- VCF path points to a readable file, if provided.
- Sample identifier is provided if run with a multi-sample VCF and the sample identifier
must be present in the VCF file.

Sanitation
##########

The following actions are performed on the analysis input:

- Malformed CURIEs are removed.
- CURIEs that do not correspond to current or obsolete HPO terms are removed.
- The obsolete HPO term identifiers are replaced with the current identifiers.


`LENIENT` validation policy
~~~~~~~~~~~~~~~~~~~~~~~~~~~

Lenient validation policy attempts to fix as many issues as possible.

Requirements
############

The policy requires all points of the minimal policy, plus:

- The subject is *NOT* annotated with an HPO term that is both present and excluded.
- The subject is *NOT* annotated with a present HPO term and its excluded ancestor.

Sanitation
##########

The actions of the minimal policy are performed, plus:

- Duplicate HPO terms are removed such that each term is present at most once.
- The HPO terms that are not descendants of Phenotypic abnormality are removed.
- The logical inconsistencies are resolved:

- If the subject is annotated with an excluded HPO term (e.g. no Focal seizure) and its excluded ancestor
(e.g. no Seizure) then the term is removed and the ancestor is kept.
- If the subject is annotated with a present HPO term (e.g. Focal seizure) and its present ancestor (e.g. Seizure),
then the ancestor is removed and the term is kept.

`STRICT` validation policy
~~~~~~~~~~~~~~~~~~~~~~~~~~

Strict validation policy adds no additional requirements than those of *lenient* policy. However, the analysis
is not run unless no errors or warnings are found.

Requirements
############

On top of the lenient policy, strict policy requires the following:

- HPO terms are unique.
- HPO terms are descendants of Phenotypic abnormality.
- There are no logical inconsistencies in HPO terms.
- Age is well formatted, if provided.
- Sex is well formatted, if provided.

Sanitation
##########

Strict policy applies no sanitation.


Use the ``--dry-run`` option to check if the inputs can be run under given validation policy.
22 changes: 13 additions & 9 deletions docs/lirical-tsv.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,31 +12,35 @@ For example, the following command will run LIRICAL on a Phenopacket and output

By default, LIRICAL outputs the data to a file called ``lirical.tsv``. This can be altered with the ``-x <prefix>`` option.

The TSV output consists of the header and the body. The header includes lines that start with an exclamation mark,
to provide information about the HPO terms used to run the analysis.

The body section summarizes the matches between the patient data and the diseases, one disease per row, ranked by
the post-test probability.
Each row includes the disease credentials, the pre-test and post-test probabilities, the composite likelihood ratio.
If the analysis was run with a VCF file, the report includes two extra columns with the gene associated with the disease
and the variants found in the gene.

.. list-table:: LIRICAL's TSV format
:header-rows: 1
:widths: 40 60

* - Item
* - Column name
- Explanation
* - rank
- placement of the candidate diagnosis by LIRICAL
- Placement of the candidate diagnosis by LIRICAL
* - diseaseName
- Name of the candidate disease
* - diseaseCurie
- disease ID, e.g., OMIM:154700
- Disease identifier, e.g., `OMIM:154700`
* - pretestprob
- Pretest probability of the candidate disease
* - postestprob
- Postest probability of the candidate disease
* - compositeLR
- Combined likelihood ratio of the candidate disease (logarithm of the product of all individual LRs)
* - entrezGeneId
- Identifier of the candidate disease gene (if available)
- Identifier of the candidate disease gene (if run with a VCF file)
* - variants
- variant evaluation (if available)


The file begins with comment lines (that start with an exclamation mark) that provide information about the
HPO terms used to run the analysis.
- Variant evaluation (if run with a VCF file)

6 changes: 5 additions & 1 deletion docs/running.rst
Original file line number Diff line number Diff line change
Expand Up @@ -87,6 +87,10 @@ The configuration options tweak the analysis.
* ``--strict``: use strict penalties if the genotype does not match the disease model
in terms of number of called pathogenic alleles (default: ``false``).
* ``--pathogenicity-threshold``: Variants with greater pathogenicity score is considered deleterious (default: ``0.8``).
* ``--validation-policy``: set the level of input sanity check, see :ref:`rst-input-sanitation` for more info.
Choose from `MINIMAL`, `LENIENT`, `STRICT` (default ``MINIMAL``).
* ``--dry-run``: check if the inputs meet the validation policy requirements, report any issues,
and exit without running the analysis (default: ``false``).

Output options
~~~~~~~~~~~~~~
Expand Down Expand Up @@ -124,7 +128,7 @@ The ``prioritize`` command takes the following options:
that correspond to the phenotype terms negated/excluded in the proband.
* ``--assembly`` genome build, choose from `hg19` or `hg38`, must be provided if ``--vcf`` is used (default: ``hg38``).
* ``--vcf``: path to VCF file with exome/genome sequencing results. The file can be compressed.
* ``--sample-id``: proband's identifier (default: `Sample`).
* ``--sample-id``: proband's identifier, must be provided if running with a multi-sample VCF file (default: `subject`).
* ``--age``: proband's age as an ISO8601 duration.
(e.g. ``P9Y`` for 9 years, ``P2Y3M`` for 2 years and 3 months, or ``P33W`` for the 33th gestational week).
* ``--sex``: proband's sex, choose from `MALE`, `FEMALE`, `UNKNOWN` (default: `UNKNOWN`).
Expand Down
2 changes: 1 addition & 1 deletion docs/setup.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ build LIRICAL from source, then the build process described below requires

.. note::
The v1 of LIRICAL was written in Java 8 but starting from v2 we require Java 17 or better to take advantage
of numerous performance improvements and novel language features.
of the novel Java features.

Building from sources
~~~~~~~~~~~~~~~~~~~~~
Expand Down
24 changes: 24 additions & 0 deletions lirical-cli/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,30 @@
<groupId>org.monarchinitiative.svart</groupId>
<artifactId>svart</artifactId>
</dependency>
<dependency>
<groupId>org.phenopackets</groupId>
<artifactId>phenopacket-schema</artifactId>
</dependency>
<dependency>
<groupId>org.phenopackets.phenopackettools</groupId>
<artifactId>phenopacket-tools-core</artifactId>
</dependency>
<dependency>
<groupId>org.phenopackets.phenopackettools</groupId>
<artifactId>phenopacket-tools-io</artifactId>
</dependency>
<dependency>
<groupId>org.phenopackets.phenopackettools</groupId>
<artifactId>phenopacket-tools-util</artifactId>
</dependency>
<dependency>
<groupId>com.google.protobuf</groupId>
<artifactId>protobuf-java</artifactId>
</dependency>
<dependency>
<groupId>com.google.protobuf</groupId>
<artifactId>protobuf-java-util</artifactId>
</dependency>
<dependency>
<groupId>ch.qos.logback</groupId>
<artifactId>logback-classic</artifactId>
Expand Down
12 changes: 0 additions & 12 deletions lirical-cli/src/examples/LDS2.v2.json
Original file line number Diff line number Diff line change
Expand Up @@ -33,18 +33,6 @@
"label": "author statement used in manual assertion"
}
}]
}, {
"type": {
"id": "HP:0001631",
"label": "Atrial septal defect"
},
"excluded": false,
"evidence": [{
"evidenceCode": {
"id": "ECO:0000302",
"label": "author statement used in manual assertion"
}
}]
}, {
"type": {
"id": "HP:0000193",
Expand Down
Loading

0 comments on commit f8ee662

Please sign in to comment.