-
Notifications
You must be signed in to change notification settings - Fork 11
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #641 from TheJacksonLaboratory/input-sanitation
Input sanitation
- Loading branch information
Showing
86 changed files
with
1,713 additions
and
847 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,179 @@ | ||
.. _rst-input-sanitation: | ||
|
||
========================= | ||
Analysis input validation | ||
========================= | ||
|
||
LIRICAL performs Q/C checks and sanitation before running the analysis. | ||
|
||
Here we summarize the requirements and checks performed on all sections of the analysis input. | ||
|
||
Analysis requirements | ||
^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
Here we summarize the requirements of inputs that LIRICAL needs for the analysis. | ||
|
||
Sample identifier | ||
~~~~~~~~~~~~~~~~~ | ||
|
||
Sample identifier MUST be provided if the analysis is run with a multi-sample VCF file. Otherwise, LIRICAL is unable | ||
to choose the variant genotypes. | ||
The identifier is *optional* if running a phenotype-only analysis or with a single-sample VCF file, | ||
where LIRICAL uses the identifier found in the VCF file. | ||
|
||
The analysis will stop if run with multi-sample VCF file and the identifier is not available, | ||
or if the provided identifier is not found in the VCF file (applies to single-sample VCFs as well). | ||
|
||
|
||
Phenotypic features | ||
~~~~~~~~~~~~~~~~~~~ | ||
|
||
LIRICAL uses a set of phenotypic features that were observed or specifically excluded in the subject to prioritize | ||
the diseases and several checks are applied to mitigate common errors ensure correctness of the analysis. | ||
|
||
The checks focus on the following: | ||
|
||
- At least one present or excluded HPO term is provided. | ||
- All phenotypic features are formatted as *Compact Uniform Resource Identifiers* (CURIEs), such as ``HP:0001250`` | ||
for *Seizure*. A valid CURIE consists of a prefix (e.g. ``HP``), delimiter (``:`` or ``_``), and id (e.g. ``0001250``). | ||
- The CURIEs are *unique*, i.e. used at most once. | ||
- The CURIEs correspond to identifiers of *current* or *obsolete* HPO terms. | ||
- The HPO terms are descendants of `Phenotypic abnormality <https://hpo.jax.org/app/browse/term/HP:0000118>`_ branch. | ||
- The HPO terms are logically consistent: | ||
|
||
- The subject is not annotated with an HPO term in observed and excluded state at the same time. | ||
- The subject is not annotated with an observed HPO term and its observed or excluded ancestor. | ||
- The subject is not annotated with an excluded HPO term and its excluded ancestor. | ||
|
||
Age | ||
~~~ | ||
|
||
LIRICAL does not use the age of the subject at the moment. However, if set, the age must be formatted | ||
as ISO8601 duration. For instance ``P1Y8M`` for 1 year and 8 months of age. | ||
|
||
Sex | ||
~~~ | ||
|
||
The sex must be provided as one of {``MALE``, ``FEMALE``, ``UNKNOWN``}. If the input is not parsable, | ||
``UNKNOWN`` is used by default. | ||
|
||
VCF file | ||
~~~~~~~~ | ||
|
||
The path to VCF file can be provided via CLI or through phenopacket/YAML file. The path must point to a file | ||
that is readable by the user running the LIRICAL process. | ||
|
||
|
||
Validation policy | ||
^^^^^^^^^^^^^^^^^ | ||
|
||
LIRICAL enforces the requirements depending on the validation policy. There are three validation policies: | ||
|
||
- *MINIMAL* | ||
- *LENIENT* | ||
- *STRICT* | ||
|
||
with increasing sanity requirements. | ||
|
||
The input validation results are always logged in the output. The log includes the following line if the input is OK:: | ||
|
||
Input sanitation found no issues | ||
|
||
Alternatively, the issues are logged to the terminal. For instance:: | ||
|
||
Found issues 0 errors and 1 warnings | ||
Errors 😱 | ||
- Sample must not be annotated with Clonic seizure [HP:0020221] while its ancestor | ||
Seizure [HP:0001250] is excluded. Resolve the logical inconsistency by choosing | ||
one of the terms. | ||
Warnings 😧 | ||
- Sample should not be annotated with Patent foramen ovale [HP:0001655] and its ancestor | ||
Atrial septal defect [HP:0001631]. Remove Atrial septal defect [HP:0001631] | ||
from the phenotype terms. | ||
|
||
The issues are classified as errors and warnings. | ||
*Error* is a serious issue that MUST be fixed and human intervention is required. | ||
Warning is an issue that SHOULD be fixed. However, unlike an error, warning can be fixed automatically. | ||
The output includes a suggested resolution, e.g. choosing *Clonic seizure* or *Seizure* in the error above. | ||
|
||
The warnings are be fixed depending on the used validation policy. | ||
|
||
|
||
`MINIMAL` validation policy | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
Minimal validation policy enforces the least constraint on the analysis inputs. | ||
The analysis is run *"as is"* and the run is aborted only if the most important information is missing. | ||
Only the most rudimentary sanitation is applied. | ||
|
||
Requirements | ||
############ | ||
|
||
The analysis requires the following: | ||
|
||
- At least one HPO term is provided. | ||
- VCF path points to a readable file, if provided. | ||
- Sample identifier is provided if run with a multi-sample VCF and the sample identifier | ||
must be present in the VCF file. | ||
|
||
Sanitation | ||
########## | ||
|
||
The following actions are performed on the analysis input: | ||
|
||
- Malformed CURIEs are removed. | ||
- CURIEs that do not correspond to current or obsolete HPO terms are removed. | ||
- The obsolete HPO term identifiers are replaced with the current identifiers. | ||
|
||
|
||
`LENIENT` validation policy | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
Lenient validation policy attempts to fix as many issues as possible. | ||
|
||
Requirements | ||
############ | ||
|
||
The policy requires all points of the minimal policy, plus: | ||
|
||
- The subject is *NOT* annotated with an HPO term that is both present and excluded. | ||
- The subject is *NOT* annotated with a present HPO term and its excluded ancestor. | ||
|
||
Sanitation | ||
########## | ||
|
||
The actions of the minimal policy are performed, plus: | ||
|
||
- Duplicate HPO terms are removed such that each term is present at most once. | ||
- The HPO terms that are not descendants of Phenotypic abnormality are removed. | ||
- The logical inconsistencies are resolved: | ||
|
||
- If the subject is annotated with an excluded HPO term (e.g. no Focal seizure) and its excluded ancestor | ||
(e.g. no Seizure) then the term is removed and the ancestor is kept. | ||
- If the subject is annotated with a present HPO term (e.g. Focal seizure) and its present ancestor (e.g. Seizure), | ||
then the ancestor is removed and the term is kept. | ||
|
||
`STRICT` validation policy | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
Strict validation policy adds no additional requirements than those of *lenient* policy. However, the analysis | ||
is not run unless no errors or warnings are found. | ||
|
||
Requirements | ||
############ | ||
|
||
On top of the lenient policy, strict policy requires the following: | ||
|
||
- HPO terms are unique. | ||
- HPO terms are descendants of Phenotypic abnormality. | ||
- There are no logical inconsistencies in HPO terms. | ||
- Age is well formatted, if provided. | ||
- Sex is well formatted, if provided. | ||
|
||
Sanitation | ||
########## | ||
|
||
Strict policy applies no sanitation. | ||
|
||
|
||
Use the ``--dry-run`` option to check if the inputs can be run under given validation policy. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.