Merge pull request #550 from moka-guys/feature/wscleaner_incorporation

Feature/wscleaner incorporation (#550)
moka-guys · Jul 11, 2024 · 5bb77b2 · 5bb77b2
2 parents 6fecf95 + add9e93
commit 5bb77b2
Show file tree

Hide file tree

Showing 36 changed files with 1,553 additions and 457 deletions.
diff --git a/.github/workflows/on-pull-request.yaml b/.github/workflows/on-pull-request.yaml
@@ -46,4 +46,4 @@ jobs:
       - name: Test with pytest
       # We do not want it to run the email tests because the credentials are not stored in GitHub
         run: |
-          python3 -m pytest -k 'not email'
+          python3 -m pytest -k 'not email and not wscleaner'
diff --git a/.gitignore b/.gitignore
@@ -6,3 +6,4 @@ seglh_naming.egg-info/
 venv/
 temp/
 .coverage
+*data_unzipped
diff --git a/README.md b/README.md
@@ -7,6 +7,7 @@ This repository contains the main scripts for routine analysis of clinical next
 |[demultiplex.py](demultiplex.py) | Command line | Demultiplex (excluding TSO runs) and calculate cluster density for Illumina NGS data using `bcl2fastq2` [(guide)](demultiplex/README.md) |
 | [setoff_workflows.py](setoff_workflows.py) | Command line | Upload NGS data to DNAnexus and trigger in-house workflows [(guide)](setoff_workflows/README.md) |
 | [upload_runfolder](upload_runfolder) | Command line or module import | Uploads an Illumina runfolder to DNAnexus [(guide)](upload_runfolder/README.md)|
+| [wscleaner](wscleaner) | Command line | Automates the deletion of runfolders that have been uploaded to the DNAnexus cloud storage service [(guide)](wscleaner/README.md)|
 
 # Assumptions / Requirements
 
@@ -16,14 +17,16 @@ Each runfolder must be discrete per workflow, therefore must consist of only one
 * SNP
 * WES
 * Custom Panels / LRPCR
+* ONCODEEP
+* DEV (with or without UMIs)
 
 The type of run is detected by the scripts by matching the Pan numbers within the sample names in the corresponding samplesheet to the pan numbers in the [panel_config](config/panel_config.py).
 
 # Setup
 
 The script has been tested using python v3.10.6 therefore it is recommended that this version of python is used.
 
-Dependencies, which include the [samplesheet_validator](https://github.com/moka-guys/samplesheet_validator) package**, are installed using the requirements.txt file:
+Dependencies, which include the [samplesheet_validator](https://github.com/moka-guys/samplesheet_validator) package\*\*, are installed using the requirements.txt file:
 
 ```bash
 pip3 install -r requirements.txt
@@ -52,18 +55,18 @@ The below diagram is a UML class diagram showing the relationships between the c
 | [demultiplex](demultiplex) | orange | Demultiplex (excluding TSO runs) and calculate cluster density for Illumina NGS data using `bcl2fastq2` [(guide)](demultiplex/README.md) |
 | [setoff_workflows](setoff_workflows) | pink | Upload NGS data to DNAnexus and trigger in-house workflows [(guide)](setoff_workflows/README.md) |
 | [toolbox](toolbox) | grey | Contains classes and functions shared [(guide)](toolbox/README.md) |
-| [upload_runfolder](upload_runfolder) | purple | Uploads an Illumina runfolder to DNAnexus [(guide)](upload_runfolder/README.md) |
+| [upload_runfolder](upload_runfolder) | sand | Uploads an Illumina runfolder to DNAnexus [(guide)](upload_runfolder/README.md) |
+| [wscleaner](wscleaner) | purple | Automates the deletion of runfolders that have been uploaded
+to the DNAnexus cloud storage service | [(guide)](wscleaner/README.md) |
 
 ### Class and Package Diagrams
 
 Class and package diagrams were generated by running the following command from the project root:
 
 ```bash
-pyreverse -o png -p automate_demultiplex . --ignore=test --source-roots . --colorized --color-palette=#CBC3E3,#99DDFF,#44BB99,#BBCC33,#EEDD88,#EE8866,#FFAABB,#DDDDDD --output-directory img/
+pyreverse -o png -p automate_demultiplex . --ignore=test --source-roots . --colorized --color-palette=#CBC3E3,#99DDFF,#44BB99,#BBCC33,#EEDD88,#EE8866,#FFAABB,#DDDDDD,#eab676 --output-directory img/
 ```
 
-
-
 ## Package Diagram
 ![alt text](img/packages_automate_demultiplex.png)
 
@@ -89,19 +92,21 @@ The above image describes the possible associations in the Class Diagram. In the
 | sw (script_logger) | Records script-level logs for the setoff workflows script | `TIMESTAMP_setoff_workflow.log` | `/usr/local/src/mokaguys/automate_demultiplexing_logfiles/sw_script_logfiles/` |
 | sw (rf_loggers["sw"]) | Records runfolder-level logs for the setoff workflows script | `RUNFOLDERNAME_setoff_workflow.log` | `/usr/local/src/mokaguys/automate_demultiplexing_logfiles/sw_script_logfiles/` |
 | dx_run_script | Records the dx run commands for processing the run. N.B. this is not written to by logging | `RUNFOLDERNAME_dx_run_commands.sh` | `/usr/local/src/mokaguys/automate_demultiplexing_logfiles/dx_run_commands` |
-| decision_support_upload_cmds | Records the dx run commands to set off the congenica upload apps. N.B. this is not written to by logging | `RUNFOLDERNAME_decision_support.sh` | `/usr/local/src/mokaguys/automate_demultiplexing_logfiles/dx_run_commands` |
+| post_run_dx_run_script | Records the postprocessing commands (TSO runs only), to be run manually after the pipeline apps complete. N.B. this is not written to by logging | `RUNFOLDERNAME_post_run_commands.sh` | `/usr/local/src/mokaguys/automate_demultiplexing_logfiles/dx_run_commands` |
+| decision_support_upload_cmds_script | Records the dx run commands to set off the congenica upload apps. N.B. this is not written to by logging | `RUNFOLDERNAME_decision_support.sh` | `/usr/local/src/mokaguys/automate_demultiplexing_logfiles/dx_run_commands` |
 | proj_creation_script | Records the commands for creating the DNAnexus project. N.B. this is not written to by logging | `RUNFOLDERNAME_create_nexus_project.sh` | `/usr/local/src/mokaguys/automate_demultiplexing_logfiles/dx_run_commands` |
 | Demultiplex output | Catches any traceback from errors when running the cron job that are not caught by exception handling within the script | `TIMESTAMP.txt` | `/usr/local/src/mokaguys/automate_demultiplexing_logfiles/Demultiplexing_stdout` |
 | demultiplex (script_logger) | Records script-level logs for the demultiplex script | `TIMESTAMP_demultiplex_script.log` | `/usr/local/src/mokaguys/automate_demultiplexing_logfiles/demultiplexing_script_logfiles/` |
 | demultiplex (demux_rf_logger) | Records runfolder-level logs for the demultiplex script | `RUNFOLDERNAME_demultiplex_runfolder.log` | `/usr/local/src/mokaguys/automate_demultiplexing_logfiles/demultiplexing_script_logfiles/` |
  Bcl2fastq output | STDOUT and STDERR from bcl2fastq2 | `bcl2fastq2_output.log` | Within the runfolder |
 | ss_validator | Records runfolder-level logs for the samplesheet_validator script | `RUNFOLDERNAME_samplesheet_validator_script.log` | `/usr/local/src/mokaguys/automate_demultiplexing_logfiles/samplesheet_validator_script_logfiles/` |
 | backup | Records the logs from the upload runfolder script | `RUNFOLDERNAME_upload_runfolder.log` | `/usr/local/src/mokaguys/automate_demultiplexing_logfiles/upload_runfolder_script_logfiles/` |
+| wscleaner | Records the logs from the wscleaner script | `TIMESTAMP_wscleaner.log` | `/usr/local/src/mokaguys/automate_demultiplexing_logfiles/wscleaner/` |
 
 
 # Pytest
 
-[test](test) contains test data ([/test/data](../test/data)) and test scripts (these use pytest).
+[test](test) contains test data ([/test/data](../test/data)), and test scripts within individual modules (these use pytest).
 
 Tests can be executed using the following command. It is important to include the ignore flag to prevent pytest from scanning for tests through all test files, which slows down the tests considerably
 
@@ -116,11 +121,12 @@ Currently test suite coverage is as follows:
 | Module | Coverage |
 | ------ | -------- |
 | [ad_email.py](ad_email/ad_email.py) | 94 |
-| [ad_logger.py](ad_logger/ad_logger.py) | 81 |
-| [demultiplex.py](demultiplex/demultiplex.py) | 76 |
+| [ad_logger.py](ad_logger/ad_logger.py) | 100 |
+| [demultiplex.py](demultiplex/demultiplex.py) | 83 |
 | [setoff_workflows.py](setoff_workflows/setoff_workflows.py) | 0 |
 | [upload_runfolder.py](upload_runfolder/upload_runfolder.py) | 0 |
-| [toolbox.py](toolbox/toolbox.py) | 0 |
+| [toolbox.py](toolbox/toolbox.py) | 76 |
+| [wscleaner.py](wscleaner/wscleaner.py) | 46 |
 
 
 **TESTS AND TEST CASES/FILES *MUST* BE MAINTAINED AND UPDATED ACCORDINGLY IN CONJUNCTION WITH SCRIPT DEVELOPMENT**

diff --git a/ad_email/ad_email.py b/ad_email/ad_email.py
@@ -5,6 +5,7 @@
 - AdEmail
     Send email to recipient via SMTP
 """
+
 import sys
 import os
 import jinja2
@@ -111,7 +112,9 @@ def send_email(
             self.msg["Subject"] = email_subject
             self.msg["From"] = self.sender
             self.msg["To"] = recipients
-            self.msg.attach(MIMEText(email_message, "html", "utf-8"))  # Add msg to e-mail body
+            self.msg.attach(
+                MIMEText(email_message, "html", "utf-8")
+            )  # Add msg to e-mail body
             self.logger.info(self.logger.log_msgs["sending_email"], self.msg)
             # Configure SMTP server connection for sending email
             with smtplib.SMTP(

diff --git a/test/test_ad_email.py → ad_email/test_ad_email.py b/test/test_ad_email.py → ad_email/test_ad_email.py
@@ -4,16 +4,22 @@
 workstation where the required auth details are stored
 """
 
+import os
 import pytest
-from .conftest import logger_obj
 from ad_email.ad_email import AdEmail
 from config.ad_config import AdEmailConfig
-
-logger_obj = logger_obj
+from ..conftest import test_data_temp
+from ad_logger import ad_logger
 
 # TODO finish this test suite as it is currently incomplete
 
 
+@pytest.fixture(scope="function")
+def logger_obj():
+    temp_log = os.path.join(test_data_temp, "temp.log")
+    return ad_logger.AdLogger(__name__, "demux", temp_log).get_logger()
+
+
 class TestAdEmail:
     """
     Test Email class

diff --git a/ad_logger/ad_logger.py b/ad_logger/ad_logger.py
@@ -1,6 +1,7 @@
 """
 Automate demultiplex logging. Classes required for logging
 """
+
 import sys
 import re
 import logging
@@ -31,14 +32,14 @@ def get_logging_formatter() -> str:
     )
 
 
-def set_root_logger() -> None:
+def set_root_logger() -> object:
     """
     Set up root logger and add stream handler and syslog handler - we only want to add these once
     else it will duplicate log messages to the terminal. All loggers named with the same stem
     as the root logger will use these same syslog handler and stream handler
-        :return None:
+        :return logger: Logging object
     """
-    sensitive_formatter=SensitiveFormatter(get_logging_formatter())
+    sensitive_formatter = SensitiveFormatter(get_logging_formatter())
     logger = logging.getLogger(AdLoggerConfig.REPO_NAME)
     stream_handler = logging.StreamHandler(sys.stdout)
     stream_handler.setFormatter(sensitive_formatter)
@@ -53,8 +54,9 @@ def set_root_logger() -> None:
         handlers=[
             stream_handler,
             syslog_handler,
-        ]
+        ],
     )
+    return logger
 
 
 def shutdown_logs(logger: logging.Logger) -> None:

diff --git a/test/test_ad_logger.py → ad_logger/test_ad_logger.py b/test/test_ad_logger.py → ad_logger/test_ad_logger.py
@@ -1,5 +1,6 @@
 """ ad_logger.py pytest unit tests. The test suite is currently incomplete
 """
+
 import pytest
 from toolbox import toolbox
 from ad_logger import ad_logger
@@ -47,5 +48,3 @@ def test_get_loggers(self, logfiles_config, caplog):
                 f"Test log message. Logger {loggers[logger_name].name}"
             )
             assert loggers[logger_name].name in caplog.text
-
-
diff --git a/config/ad_config.py b/config/ad_config.py
@@ -9,6 +9,7 @@
 - ToolboxConfig
 - URConfig
 """
+
 import os
 import sys
 import datetime
@@ -86,7 +87,9 @@
 # DNAnexus upload agent path
 UPLOAD_AGENT_EXE = f"{DOCUMENT_ROOT}/apps/dnanexus-upload-agent-1.5.17-linux/ua"
 BCL2FASTQ_DOCKER = "seglh/bcl2fastq2:v2.20.0.422_60dbb5a"
-GATK_DOCKER = "broadinstitute/gatk:4.1.8.1"  # TODO this image should have a hash added in future
+GATK_DOCKER = (
+    "broadinstitute/gatk:4.1.8.1"  # TODO this image should have a hash added in future
+)
 
 LANE_METRICS_SUFFIX = ".illumina_lane_metrics"
 DEMUX_NOT_REQUIRED_MSG = "%s run. Does not need demultiplexing locally"
@@ -123,7 +126,7 @@
         "ed_readcount": f"{TOOLS_PROJECT}:applet-GbkVzbQ0jy1zBZf5k6Xk6QP7",  # ED_readcount_analysis_v1.3.0
         "ed_cnvcalling": f"{TOOLS_PROJECT}:applet-GbkVyQ80jy1Xf1p6jpPK6p1x",  # ED_cnv_calling_v1.3.0
         "rpkm": f"{TOOLS_PROJECT}:applet-FxJj0F00jy1ZVXp36PBz2p1j",  # RPKM_using_conifer_v1.6
-        "duty_csv": f"{TOOLS_PROJECT}:applet-GkzJfX80jy1fQvPk1z316gBy",  # duty_csv_v1.4.0
+        "duty_csv": f"{TOOLS_PROJECT}:applet-Gp75GB00360KXPV4Jy7PPFfQ",  # duty_csv_v1.5.0
     },
     "WORKFLOWS": {
         "pipe": f"{TOOLS_PROJECT}:workflow-GPq04280jy1k1yVkQP0fXqBg",  # GATK3.5_v2.18
@@ -262,7 +265,10 @@
 
 DX_CMDS = {
     "create_proj": 'PROJECT_ID="$(dx new project --bill-to %s "%s" --brief --auth ${AUTH})"',
-    "find_proj_name": f"{SDK_SOURCE}; dx find projects --name *%s* " "--auth %s | awk '{print $3}'",
+    "find_proj_name": (
+        f"{SDK_SOURCE}; dx find projects --name *%s* "
+        "--auth %s | awk '{print $3}'"
+    ),
     "proj_name_from_id": f"{SDK_SOURCE}; dx describe %s --auth %s --json | jq -r .name",
     "find_proj_id": f"{SDK_SOURCE}; dx describe %s --auth %s --json | jq -r .id",
     "find_execution_id": (
@@ -366,7 +372,7 @@ class DemultiplexConfig(PanelConfig):
         "checksums_do_not_match": "Checksums do not match",  # Failure message written to md5sum file by integrity check scripts
         "samplesheet_success": "Samplesheet check successful with no errors identified: %s",
         "samplesheet_fail": "Processing halted. SampleSheet contains SampleSheet errors: %s ",
-        "upload_flag_umis": "Runfolder contains UMIs. Runfolder will not be uploaded and requires manual upload: %s"
+        "upload_flag_umis": "Runfolder contains UMIs. Runfolder will not be uploaded and requires manual upload: %s",
     }
     TESTING = TESTING
     BCL2FASTQ2_CMD = (
@@ -417,7 +423,7 @@ class SWConfig(PanelConfig):
     RUNFOLDERS = RUNFOLDERS
     PROD_ORGANISATION = "org-viapath_prod"  # Prod org for billing
     if BRANCH == "main":  # Prod branch
-        
+
         BSPS_ID = "BSPS_MD"
         DNANEXUS_USERS = {  # User access level
             # TODO remove InterpretationRequest user once per-user accounts set up
@@ -538,7 +544,8 @@ class ToolboxConfig(PanelConfig):
     """
     Toolbox configuration
     """
-    if BRANCH == "master":
+
+    if BRANCH == "main":
         DNANEXUS_PROJECT_PREFIX = "002_"  # Denotes production status of run
     else:
         DNANEXUS_PROJECT_PREFIX = "003_"  # Denotes development status of run
@@ -596,3 +603,14 @@ class URConfig:
     STRINGS = {
         "upload_started": "Upload started",  # Statement to write to DNAnexus upload started file
     }
+
+
+class RunfolderCleanupConfig(PanelConfig):
+    """
+    Runfolder Cleanup configuration
+    """
+
+    TIMESTAMP = TIMESTAMP
+    RUNFOLDER_PATTERN = RUNFOLDER_PATTERN
+    RUNFOLDERS = RUNFOLDERS
+    CREDENTIALS = CREDENTIALS
diff --git a/config/log_msgs_config.py b/config/log_msgs_config.py
@@ -37,6 +37,7 @@
         "fastq_nonexistent": "No fastq could be intentified that matches the following strings: %s. Error: %s",
         "sample_excluded": "Sample excluded from samples dictionary due to missing fastqs: %s",
         "control_sample": "%s control sample detected: %s",
+        "missing_panno": "Could not identify pan number from the sample name in the sample sheet: %s",
         "multiple_pipeline_names": (
             "Multiple pipeline names detected from panel config for sample list: %s. Scripts do not support different "
             "pipelines for the same run. Supported pipelines: %s"
@@ -45,6 +46,8 @@
         "fastq_valid": "Gzip --test determined that the fastq is valid: %s",
         "fastq_invalid": "Gzip --test determined that the fastq is not valid: %s. Stdout: %s. Stderr: %s",
         "demux_success": "Demultiplexing was successful for the run with all fastqs valid",
+        "wes_batch_nos_identified": "WES batch numbers %s identified",
+        "wes_batch_nos_missing": "WES batch numbers missing. Check for errors in the sample names. Script exited",
     },
     "ad_email": {
         "sending_email": "Sending the email message: %s",
@@ -146,8 +149,6 @@
         "upload_rf_error": (
             "An error occurred when uploading the rest of the runfolder: %s. See %s and %s for further details. Script exited"
         ),
-        "wes_batch_nos_identified": "WES batch numbers %s identified",
-        "wes_batch_nos_missing": "WES batch numbers missing. Check for errors in the sample names. Script exited",
         "library_no_err": "Unable to identify library numbers. Script exited. Check for underscores in the sample names.",
         "checking_fastq": "Checking fastq has been collected: %s",
         "sample_match": "Fastq in the BaseCalls directory matches the sample name in the SampleSheet: %s, %s",

diff --git a/config/panel_config.py b/config/panel_config.py
@@ -47,6 +47,7 @@
     dry_lab                         True if required to share with dry lab, None if not
     umis                            True if run has UMIs
 """
+
 # TODO in future do we want to swap physical paths for file IDs
 
 TOOLS_PROJECT = "project-ByfFPz00jy1fk6PjpZ95F27J"  # 001_ToolsReferenceData
-Original file line number
+Diff line change
@@ Expand Up / @@ -6,3 +6,4 @@ seglh_naming.egg-info/ @@
     venv/
     temp/
     .coverage
+    *data_unzipped