Skip to content

Commit

Permalink
Merge pull request #550 from moka-guys/feature/wscleaner_incorporation
Browse files Browse the repository at this point in the history
Feature/wscleaner incorporation (#550)
  • Loading branch information
rebeccahaines1 authored Jul 11, 2024
2 parents 6fecf95 + add9e93 commit 5bb77b2
Show file tree
Hide file tree
Showing 36 changed files with 1,553 additions and 457 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/on-pull-request.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -46,4 +46,4 @@ jobs:
- name: Test with pytest
# We do not want it to run the email tests because the credentials are not stored in GitHub
run: |
python3 -m pytest -k 'not email'
python3 -m pytest -k 'not email and not wscleaner'
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,4 @@ seglh_naming.egg-info/
venv/
temp/
.coverage
*data_unzipped
26 changes: 16 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ This repository contains the main scripts for routine analysis of clinical next
|[demultiplex.py](demultiplex.py) | Command line | Demultiplex (excluding TSO runs) and calculate cluster density for Illumina NGS data using `bcl2fastq2` [(guide)](demultiplex/README.md) |
| [setoff_workflows.py](setoff_workflows.py) | Command line | Upload NGS data to DNAnexus and trigger in-house workflows [(guide)](setoff_workflows/README.md) |
| [upload_runfolder](upload_runfolder) | Command line or module import | Uploads an Illumina runfolder to DNAnexus [(guide)](upload_runfolder/README.md)|
| [wscleaner](wscleaner) | Command line | Automates the deletion of runfolders that have been uploaded to the DNAnexus cloud storage service [(guide)](wscleaner/README.md)|

# Assumptions / Requirements

Expand All @@ -16,14 +17,16 @@ Each runfolder must be discrete per workflow, therefore must consist of only one
* SNP
* WES
* Custom Panels / LRPCR
* ONCODEEP
* DEV (with or without UMIs)

The type of run is detected by the scripts by matching the Pan numbers within the sample names in the corresponding samplesheet to the pan numbers in the [panel_config](config/panel_config.py).

# Setup

The script has been tested using python v3.10.6 therefore it is recommended that this version of python is used.

Dependencies, which include the [samplesheet_validator](https://github.com/moka-guys/samplesheet_validator) package**, are installed using the requirements.txt file:
Dependencies, which include the [samplesheet_validator](https://github.com/moka-guys/samplesheet_validator) package\*\*, are installed using the requirements.txt file:

```bash
pip3 install -r requirements.txt
Expand Down Expand Up @@ -52,18 +55,18 @@ The below diagram is a UML class diagram showing the relationships between the c
| [demultiplex](demultiplex) | orange | Demultiplex (excluding TSO runs) and calculate cluster density for Illumina NGS data using `bcl2fastq2` [(guide)](demultiplex/README.md) |
| [setoff_workflows](setoff_workflows) | pink | Upload NGS data to DNAnexus and trigger in-house workflows [(guide)](setoff_workflows/README.md) |
| [toolbox](toolbox) | grey | Contains classes and functions shared [(guide)](toolbox/README.md) |
| [upload_runfolder](upload_runfolder) | purple | Uploads an Illumina runfolder to DNAnexus [(guide)](upload_runfolder/README.md) |
| [upload_runfolder](upload_runfolder) | sand | Uploads an Illumina runfolder to DNAnexus [(guide)](upload_runfolder/README.md) |
| [wscleaner](wscleaner) | purple | Automates the deletion of runfolders that have been uploaded
to the DNAnexus cloud storage service | [(guide)](wscleaner/README.md) |

### Class and Package Diagrams

Class and package diagrams were generated by running the following command from the project root:

```bash
pyreverse -o png -p automate_demultiplex . --ignore=test --source-roots . --colorized --color-palette=#CBC3E3,#99DDFF,#44BB99,#BBCC33,#EEDD88,#EE8866,#FFAABB,#DDDDDD --output-directory img/
pyreverse -o png -p automate_demultiplex . --ignore=test --source-roots . --colorized --color-palette=#CBC3E3,#99DDFF,#44BB99,#BBCC33,#EEDD88,#EE8866,#FFAABB,#DDDDDD,#eab676 --output-directory img/
```



## Package Diagram
![alt text](img/packages_automate_demultiplex.png)

Expand All @@ -89,19 +92,21 @@ The above image describes the possible associations in the Class Diagram. In the
| sw (script_logger) | Records script-level logs for the setoff workflows script | `TIMESTAMP_setoff_workflow.log` | `/usr/local/src/mokaguys/automate_demultiplexing_logfiles/sw_script_logfiles/` |
| sw (rf_loggers["sw"]) | Records runfolder-level logs for the setoff workflows script | `RUNFOLDERNAME_setoff_workflow.log` | `/usr/local/src/mokaguys/automate_demultiplexing_logfiles/sw_script_logfiles/` |
| dx_run_script | Records the dx run commands for processing the run. N.B. this is not written to by logging | `RUNFOLDERNAME_dx_run_commands.sh` | `/usr/local/src/mokaguys/automate_demultiplexing_logfiles/dx_run_commands` |
| decision_support_upload_cmds | Records the dx run commands to set off the congenica upload apps. N.B. this is not written to by logging | `RUNFOLDERNAME_decision_support.sh` | `/usr/local/src/mokaguys/automate_demultiplexing_logfiles/dx_run_commands` |
| post_run_dx_run_script | Records the postprocessing commands (TSO runs only), to be run manually after the pipeline apps complete. N.B. this is not written to by logging | `RUNFOLDERNAME_post_run_commands.sh` | `/usr/local/src/mokaguys/automate_demultiplexing_logfiles/dx_run_commands` |
| decision_support_upload_cmds_script | Records the dx run commands to set off the congenica upload apps. N.B. this is not written to by logging | `RUNFOLDERNAME_decision_support.sh` | `/usr/local/src/mokaguys/automate_demultiplexing_logfiles/dx_run_commands` |
| proj_creation_script | Records the commands for creating the DNAnexus project. N.B. this is not written to by logging | `RUNFOLDERNAME_create_nexus_project.sh` | `/usr/local/src/mokaguys/automate_demultiplexing_logfiles/dx_run_commands` |
| Demultiplex output | Catches any traceback from errors when running the cron job that are not caught by exception handling within the script | `TIMESTAMP.txt` | `/usr/local/src/mokaguys/automate_demultiplexing_logfiles/Demultiplexing_stdout` |
| demultiplex (script_logger) | Records script-level logs for the demultiplex script | `TIMESTAMP_demultiplex_script.log` | `/usr/local/src/mokaguys/automate_demultiplexing_logfiles/demultiplexing_script_logfiles/` |
| demultiplex (demux_rf_logger) | Records runfolder-level logs for the demultiplex script | `RUNFOLDERNAME_demultiplex_runfolder.log` | `/usr/local/src/mokaguys/automate_demultiplexing_logfiles/demultiplexing_script_logfiles/` |
Bcl2fastq output | STDOUT and STDERR from bcl2fastq2 | `bcl2fastq2_output.log` | Within the runfolder |
| ss_validator | Records runfolder-level logs for the samplesheet_validator script | `RUNFOLDERNAME_samplesheet_validator_script.log` | `/usr/local/src/mokaguys/automate_demultiplexing_logfiles/samplesheet_validator_script_logfiles/` |
| backup | Records the logs from the upload runfolder script | `RUNFOLDERNAME_upload_runfolder.log` | `/usr/local/src/mokaguys/automate_demultiplexing_logfiles/upload_runfolder_script_logfiles/` |
| wscleaner | Records the logs from the wscleaner script | `TIMESTAMP_wscleaner.log` | `/usr/local/src/mokaguys/automate_demultiplexing_logfiles/wscleaner/` |


# Pytest

[test](test) contains test data ([/test/data](../test/data)) and test scripts (these use pytest).
[test](test) contains test data ([/test/data](../test/data)), and test scripts within individual modules (these use pytest).

Tests can be executed using the following command. It is important to include the ignore flag to prevent pytest from scanning for tests through all test files, which slows down the tests considerably

Expand All @@ -116,11 +121,12 @@ Currently test suite coverage is as follows:
| Module | Coverage |
| ------ | -------- |
| [ad_email.py](ad_email/ad_email.py) | 94 |
| [ad_logger.py](ad_logger/ad_logger.py) | 81 |
| [demultiplex.py](demultiplex/demultiplex.py) | 76 |
| [ad_logger.py](ad_logger/ad_logger.py) | 100 |
| [demultiplex.py](demultiplex/demultiplex.py) | 83 |
| [setoff_workflows.py](setoff_workflows/setoff_workflows.py) | 0 |
| [upload_runfolder.py](upload_runfolder/upload_runfolder.py) | 0 |
| [toolbox.py](toolbox/toolbox.py) | 0 |
| [toolbox.py](toolbox/toolbox.py) | 76 |
| [wscleaner.py](wscleaner/wscleaner.py) | 46 |


**TESTS AND TEST CASES/FILES *MUST* BE MAINTAINED AND UPDATED ACCORDINGLY IN CONJUNCTION WITH SCRIPT DEVELOPMENT**
Expand Down
5 changes: 4 additions & 1 deletion ad_email/ad_email.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
- AdEmail
Send email to recipient via SMTP
"""

import sys
import os
import jinja2
Expand Down Expand Up @@ -111,7 +112,9 @@ def send_email(
self.msg["Subject"] = email_subject
self.msg["From"] = self.sender
self.msg["To"] = recipients
self.msg.attach(MIMEText(email_message, "html", "utf-8")) # Add msg to e-mail body
self.msg.attach(
MIMEText(email_message, "html", "utf-8")
) # Add msg to e-mail body
self.logger.info(self.logger.log_msgs["sending_email"], self.msg)
# Configure SMTP server connection for sending email
with smtplib.SMTP(
Expand Down
12 changes: 9 additions & 3 deletions test/test_ad_email.py → ad_email/test_ad_email.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,16 +4,22 @@
workstation where the required auth details are stored
"""

import os
import pytest
from .conftest import logger_obj
from ad_email.ad_email import AdEmail
from config.ad_config import AdEmailConfig

logger_obj = logger_obj
from ..conftest import test_data_temp
from ad_logger import ad_logger

# TODO finish this test suite as it is currently incomplete


@pytest.fixture(scope="function")
def logger_obj():
temp_log = os.path.join(test_data_temp, "temp.log")
return ad_logger.AdLogger(__name__, "demux", temp_log).get_logger()


class TestAdEmail:
"""
Test Email class
Expand Down
10 changes: 6 additions & 4 deletions ad_logger/ad_logger.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
"""
Automate demultiplex logging. Classes required for logging
"""

import sys
import re
import logging
Expand Down Expand Up @@ -31,14 +32,14 @@ def get_logging_formatter() -> str:
)


def set_root_logger() -> None:
def set_root_logger() -> object:
"""
Set up root logger and add stream handler and syslog handler - we only want to add these once
else it will duplicate log messages to the terminal. All loggers named with the same stem
as the root logger will use these same syslog handler and stream handler
:return None:
:return logger: Logging object
"""
sensitive_formatter=SensitiveFormatter(get_logging_formatter())
sensitive_formatter = SensitiveFormatter(get_logging_formatter())
logger = logging.getLogger(AdLoggerConfig.REPO_NAME)
stream_handler = logging.StreamHandler(sys.stdout)
stream_handler.setFormatter(sensitive_formatter)
Expand All @@ -53,8 +54,9 @@ def set_root_logger() -> None:
handlers=[
stream_handler,
syslog_handler,
]
],
)
return logger


def shutdown_logs(logger: logging.Logger) -> None:
Expand Down
3 changes: 1 addition & 2 deletions test/test_ad_logger.py → ad_logger/test_ad_logger.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
""" ad_logger.py pytest unit tests. The test suite is currently incomplete
"""

import pytest
from toolbox import toolbox
from ad_logger import ad_logger
Expand Down Expand Up @@ -47,5 +48,3 @@ def test_get_loggers(self, logfiles_config, caplog):
f"Test log message. Logger {loggers[logger_name].name}"
)
assert loggers[logger_name].name in caplog.text


30 changes: 24 additions & 6 deletions config/ad_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
- ToolboxConfig
- URConfig
"""

import os
import sys
import datetime
Expand Down Expand Up @@ -86,7 +87,9 @@
# DNAnexus upload agent path
UPLOAD_AGENT_EXE = f"{DOCUMENT_ROOT}/apps/dnanexus-upload-agent-1.5.17-linux/ua"
BCL2FASTQ_DOCKER = "seglh/bcl2fastq2:v2.20.0.422_60dbb5a"
GATK_DOCKER = "broadinstitute/gatk:4.1.8.1" # TODO this image should have a hash added in future
GATK_DOCKER = (
"broadinstitute/gatk:4.1.8.1" # TODO this image should have a hash added in future
)

LANE_METRICS_SUFFIX = ".illumina_lane_metrics"
DEMUX_NOT_REQUIRED_MSG = "%s run. Does not need demultiplexing locally"
Expand Down Expand Up @@ -123,7 +126,7 @@
"ed_readcount": f"{TOOLS_PROJECT}:applet-GbkVzbQ0jy1zBZf5k6Xk6QP7", # ED_readcount_analysis_v1.3.0
"ed_cnvcalling": f"{TOOLS_PROJECT}:applet-GbkVyQ80jy1Xf1p6jpPK6p1x", # ED_cnv_calling_v1.3.0
"rpkm": f"{TOOLS_PROJECT}:applet-FxJj0F00jy1ZVXp36PBz2p1j", # RPKM_using_conifer_v1.6
"duty_csv": f"{TOOLS_PROJECT}:applet-GkzJfX80jy1fQvPk1z316gBy", # duty_csv_v1.4.0
"duty_csv": f"{TOOLS_PROJECT}:applet-Gp75GB00360KXPV4Jy7PPFfQ", # duty_csv_v1.5.0
},
"WORKFLOWS": {
"pipe": f"{TOOLS_PROJECT}:workflow-GPq04280jy1k1yVkQP0fXqBg", # GATK3.5_v2.18
Expand Down Expand Up @@ -262,7 +265,10 @@

DX_CMDS = {
"create_proj": 'PROJECT_ID="$(dx new project --bill-to %s "%s" --brief --auth ${AUTH})"',
"find_proj_name": f"{SDK_SOURCE}; dx find projects --name *%s* " "--auth %s | awk '{print $3}'",
"find_proj_name": (
f"{SDK_SOURCE}; dx find projects --name *%s* "
"--auth %s | awk '{print $3}'"
),
"proj_name_from_id": f"{SDK_SOURCE}; dx describe %s --auth %s --json | jq -r .name",
"find_proj_id": f"{SDK_SOURCE}; dx describe %s --auth %s --json | jq -r .id",
"find_execution_id": (
Expand Down Expand Up @@ -366,7 +372,7 @@ class DemultiplexConfig(PanelConfig):
"checksums_do_not_match": "Checksums do not match", # Failure message written to md5sum file by integrity check scripts
"samplesheet_success": "Samplesheet check successful with no errors identified: %s",
"samplesheet_fail": "Processing halted. SampleSheet contains SampleSheet errors: %s ",
"upload_flag_umis": "Runfolder contains UMIs. Runfolder will not be uploaded and requires manual upload: %s"
"upload_flag_umis": "Runfolder contains UMIs. Runfolder will not be uploaded and requires manual upload: %s",
}
TESTING = TESTING
BCL2FASTQ2_CMD = (
Expand Down Expand Up @@ -417,7 +423,7 @@ class SWConfig(PanelConfig):
RUNFOLDERS = RUNFOLDERS
PROD_ORGANISATION = "org-viapath_prod" # Prod org for billing
if BRANCH == "main": # Prod branch

BSPS_ID = "BSPS_MD"
DNANEXUS_USERS = { # User access level
# TODO remove InterpretationRequest user once per-user accounts set up
Expand Down Expand Up @@ -538,7 +544,8 @@ class ToolboxConfig(PanelConfig):
"""
Toolbox configuration
"""
if BRANCH == "master":

if BRANCH == "main":
DNANEXUS_PROJECT_PREFIX = "002_" # Denotes production status of run
else:
DNANEXUS_PROJECT_PREFIX = "003_" # Denotes development status of run
Expand Down Expand Up @@ -596,3 +603,14 @@ class URConfig:
STRINGS = {
"upload_started": "Upload started", # Statement to write to DNAnexus upload started file
}


class RunfolderCleanupConfig(PanelConfig):
"""
Runfolder Cleanup configuration
"""

TIMESTAMP = TIMESTAMP
RUNFOLDER_PATTERN = RUNFOLDER_PATTERN
RUNFOLDERS = RUNFOLDERS
CREDENTIALS = CREDENTIALS
5 changes: 3 additions & 2 deletions config/log_msgs_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@
"fastq_nonexistent": "No fastq could be intentified that matches the following strings: %s. Error: %s",
"sample_excluded": "Sample excluded from samples dictionary due to missing fastqs: %s",
"control_sample": "%s control sample detected: %s",
"missing_panno": "Could not identify pan number from the sample name in the sample sheet: %s",
"multiple_pipeline_names": (
"Multiple pipeline names detected from panel config for sample list: %s. Scripts do not support different "
"pipelines for the same run. Supported pipelines: %s"
Expand All @@ -45,6 +46,8 @@
"fastq_valid": "Gzip --test determined that the fastq is valid: %s",
"fastq_invalid": "Gzip --test determined that the fastq is not valid: %s. Stdout: %s. Stderr: %s",
"demux_success": "Demultiplexing was successful for the run with all fastqs valid",
"wes_batch_nos_identified": "WES batch numbers %s identified",
"wes_batch_nos_missing": "WES batch numbers missing. Check for errors in the sample names. Script exited",
},
"ad_email": {
"sending_email": "Sending the email message: %s",
Expand Down Expand Up @@ -146,8 +149,6 @@
"upload_rf_error": (
"An error occurred when uploading the rest of the runfolder: %s. See %s and %s for further details. Script exited"
),
"wes_batch_nos_identified": "WES batch numbers %s identified",
"wes_batch_nos_missing": "WES batch numbers missing. Check for errors in the sample names. Script exited",
"library_no_err": "Unable to identify library numbers. Script exited. Check for underscores in the sample names.",
"checking_fastq": "Checking fastq has been collected: %s",
"sample_match": "Fastq in the BaseCalls directory matches the sample name in the SampleSheet: %s, %s",
Expand Down
1 change: 1 addition & 0 deletions config/panel_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,7 @@
dry_lab True if required to share with dry lab, None if not
umis True if run has UMIs
"""

# TODO in future do we want to swap physical paths for file IDs

TOOLS_PROJECT = "project-ByfFPz00jy1fk6PjpZ95F27J" # 001_ToolsReferenceData
Expand Down
Loading

0 comments on commit 5bb77b2

Please sign in to comment.