0.1.0; fixes issues 10,19,20,22,23,25

bede · Jul 23, 2023 · 50b9f53 · 50b9f53
1 parent 8116618
commit 50b9f53
Show file tree

Hide file tree

Showing 28 changed files with 365 additions and 234 deletions.
diff --git a/README.md b/README.md
@@ -1,20 +1,18 @@
-[![Tests](https://github.com/bede/hostile/actions/workflows/test.yml/badge.svg)](https://github.com/bede/hostile/actions/workflows/test.yml) [![PyPI version](https://img.shields.io/pypi/v/hostile)](https://pypi.org/project/hostile/) [![Bioconda version](https://anaconda.org/bioconda/hostile/badges/version.svg)](https://anaconda.org/bioconda/hostile/) [![Install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat-square&logo=anaconda)](https://biocontainers.pro/tools/hostile) [![Install with Docker](https://img.shields.io/badge/install%20with-docker-important.svg?style=flat-square&logo=docker)](https://biocontainers.pro/tools/hostile) [![DOI:10.1101/2023.07.04.547735](http://img.shields.io/badge/BioRxiv-10.1101/2023.07.04.547735-bd2736.svg)](https://doi.org/10.1101/2023.07.04.547735)
+[![Tests](https://github.com/bede/hostile/actions/workflows/test.yml/badge.svg)](https://github.com/bede/hostile/actions/workflows/test.yml) [![PyPI version](https://img.shields.io/pypi/v/hostile)](https://pypi.org/project/hostile/) [![Bioconda version](https://anaconda.org/bioconda/hostile/badges/version.svg)](https://anaconda.org/bioconda/hostile/) [![Install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat-square&logo=anaconda)](https://biocontainers.pro/tools/hostile) [![Install with Docker](https://img.shields.io/badge/install%20with-docker-important.svg?style=flat-square&logo=docker)](https://biocontainers.pro/tools/hostile) [![DOI:10.1101/2023.07.04.547735](http://img.shields.io/badge/BioRxiv-10.1101/2023.07.04.547735-bd2736.svg)](https://www.biorxiv.org/content/10.1101/2023.07.04.547735)
 
 # Hostile
 
-Hostile removes host sequences from short and long reads, consuming paired or unpaired `fastq[.gz]` input. Batteries are included – Hostile downloads and saves a human T2T-CHM13v2.0 + HLA reference when run for the first time. Read headers can be replaced with integers (using `--rename`) for privacy and more compressible FASTQs. Hostile is implemented as a Python package with a CLI and Python API, but heavy lifting is all done by compiled code (Minimap2/Bowtie2 and Samtools). Bowtie2 is the default and recommended aligner for short (paired) reads while Minimap2 is default and recommended aligner for long reads. When used with a masked reference genome, Hostile achieves near-perfect retention of microbial reads while removing >99.5% of human reads. Please read the [BioRxiv preprint](https://www.biorxiv.org/content/10.1101/2023.07.04.547735) for further information and open a GitHub issue, [tweet](https://twitter.com/beconsta) or [toot](https://mstdn.science/@bede) me to report issues or suggest improvements.
-
-
+Hostile removes host sequences from short and long reads, consuming paired or unpaired `fastq[.gz]` input. Batteries are included – a human reference genome is downloaded when run for the first time. For maximum retention of microbial reads, an existing masked reference genome can be downloaded, or a new one created for target organisms. When used with a masked reference genome, Hostile achieves near-perfect retention of microbial reads while removing >99.6% of human reads. Read headers can be replaced with integers (using `--rename`) for privacy and smaller FASTQs. Heavy lifting is done with fast existing tools (Minimap2/Bowtie2 and Samtools). Bowtie2 is the default aligner for short (paired) reads while Minimap2 is default aligner for long reads. Benchmarks and further info can be found in the [BioRxiv preprint](https://www.biorxiv.org/content/10.1101/2023.07.04.547735) (please cite if useful!). Feel free open an issue, [tweet](https://twitter.com/beconsta) or [toot](https://mstdn.science/@bede) me to report problems or suggest improvements.
 
 ## Reference genomes
 
-The default `human-t2t-hla` reference is downloaded when running Hostile for the first time. This can be overriden by specifying a custom `--index`. Bowtie2 indexes need to be untarred before use. The databases `human-t2t-hla` and `human-t2t-hla-argos985-mycob140`  were compared in the [paper](https://www.biorxiv.org/content/10.1101/2023.07.04.547735).
+The default `human-t2t-hla` reference is downloaded when running Hostile for the first time. This can be overriden by specifying a custom `--index`. Bowtie2 indexes need to be untarred before use. The databases `human-t2t-hla` and `human-t2t-hla-argos985-mycob140` were compared in the [paper](https://www.biorxiv.org/content/10.1101/2023.07.04.547735).
 
-|               Name                |                         Composition                          |                      Genome (Minimap2)                       |                        Bowtie2 index                         |
-| :-------------------------------: | :----------------------------------------------------------: | :----------------------------------------------------------: | :----------------------------------------------------------: |
-|   `human-t2t-hla` **(default)**   | [T2T-CHM13v2.0](https://www.ncbi.nlm.nih.gov/assembly/11828891) + [IPD-IMGT/HLA](https://www.ebi.ac.uk/ipd/imgt/hla/) v3.51 | [human-t2t-hla.fa.gz](https://objectstorage.uk-london-1.oraclecloud.com/n/lrbvkel2wjot/b/human-genome-bucket/o/human-t2t-hla.fa.gz) | [human-t2t-hla.tar](https://objectstorage.uk-london-1.oraclecloud.com/n/lrbvkel2wjot/b/human-genome-bucket/o/human-t2t-hla.tar) |
-|     `human-t2t-hla-argos985`      | [T2T-CHM13v2.0](https://www.ncbi.nlm.nih.gov/assembly/11828891) & [IPD-IMGT/HLA](https://www.ebi.ac.uk/ipd/imgt/hla/) v3.51; masked with [985](https://github.com/bede/hostile/blob/main/paper/supplementary-table-2.tsv) [FDA-ARGOS](https://www.ncbi.nlm.nih.gov/bioproject/231221) 150mers | [human-t2t-hla-argos985.fa.gz](https://objectstorage.uk-london-1.oraclecloud.com/n/lrbvkel2wjot/b/human-genome-bucket/o/human-t2t-hla-argos985.fa.gz) | [human-t2t-hla-argos985.tar](https://objectstorage.uk-london-1.oraclecloud.com/n/lrbvkel2wjot/b/human-genome-bucket/o/human-t2t-hla-argos985.tar) |
-| `human-t2t-hla-argos985-mycob140` | [T2T-CHM13v2.0](https://www.ncbi.nlm.nih.gov/assembly/11828891) & [IPD-IMGT/HLA](https://www.ebi.ac.uk/ipd/imgt/hla/) v3.51; masked with [985](https://github.com/bede/hostile/blob/main/paper/supplementary-table-2.tsv) [FDA-ARGOS](https://www.ncbi.nlm.nih.gov/bioproject/231221) & [140](https://github.com/bede/hostile/blob/main/paper/supplementary-table-2.tsv) mycobacterial 150mers | [human-t2t-hla-argos985-mycob140.fa.gz](https://objectstorage.uk-london-1.oraclecloud.com/n/lrbvkel2wjot/b/human-genome-bucket/o/human-t2t-hla-argos985-mycob140.fa.gz) | [human-t2t-hla-argos985-mycob140.tar](https://objectstorage.uk-london-1.oraclecloud.com/n/lrbvkel2wjot/b/human-genome-bucket/o/human-t2t-hla-argos985-mycob140.tar) |
+|               Name                |                         Composition                          |                      Genome (Minimap2)                       |                        Bowtie2 index                         | Date    |
+| :-------------------------------: | :----------------------------------------------------------: | :----------------------------------------------------------: | :----------------------------------------------------------: | ------- |
+|   `human-t2t-hla` **(default)**   | [T2T-CHM13v2.0](https://www.ncbi.nlm.nih.gov/assembly/11828891) + [IPD-IMGT/HLA](https://www.ebi.ac.uk/ipd/imgt/hla/) v3.51 | [human-t2t-hla.fa.gz](https://objectstorage.uk-london-1.oraclecloud.com/n/lrbvkel2wjot/b/human-genome-bucket/o/human-t2t-hla.fa.gz) | [human-t2t-hla.tar](https://objectstorage.uk-london-1.oraclecloud.com/n/lrbvkel2wjot/b/human-genome-bucket/o/human-t2t-hla.tar) | 2023-07 |
+|     `human-t2t-hla-argos985`      | [T2T-CHM13v2.0](https://www.ncbi.nlm.nih.gov/assembly/11828891) & [IPD-IMGT/HLA](https://www.ebi.ac.uk/ipd/imgt/hla/) v3.51; masked with [985](https://github.com/bede/hostile/blob/main/paper/supplementary-table-2.tsv) [FDA-ARGOS](https://www.ncbi.nlm.nih.gov/bioproject/231221) 150mers | [human-t2t-hla-argos985.fa.gz](https://objectstorage.uk-london-1.oraclecloud.com/n/lrbvkel2wjot/b/human-genome-bucket/o/human-t2t-hla-argos985.fa.gz) | [human-t2t-hla-argos985.tar](https://objectstorage.uk-london-1.oraclecloud.com/n/lrbvkel2wjot/b/human-genome-bucket/o/human-t2t-hla-argos985.tar) | 2023-07 |
+| `human-t2t-hla-argos985-mycob140` | [T2T-CHM13v2.0](https://www.ncbi.nlm.nih.gov/assembly/11828891) & [IPD-IMGT/HLA](https://www.ebi.ac.uk/ipd/imgt/hla/) v3.51; masked with [985](https://github.com/bede/hostile/blob/main/paper/supplementary-table-2.tsv) [FDA-ARGOS](https://www.ncbi.nlm.nih.gov/bioproject/231221) & [140](https://github.com/bede/hostile/blob/main/paper/supplementary-table-2.tsv) mycobacterial 150mers | [human-t2t-hla-argos985-mycob140.fa.gz](https://objectstorage.uk-london-1.oraclecloud.com/n/lrbvkel2wjot/b/human-genome-bucket/o/human-t2t-hla-argos985-mycob140.fa.gz) | [human-t2t-hla-argos985-mycob140.tar](https://objectstorage.uk-london-1.oraclecloud.com/n/lrbvkel2wjot/b/human-genome-bucket/o/human-t2t-hla-argos985-mycob140.tar) | 2023-07 |
 
 
 
@@ -25,7 +23,7 @@ Installation with conda/mamba or Docker is recommended due to non-Python depende
 **Conda/mamba**
 
 ```bash
-conda create -n hostile -c conda-forge -c bioconda hostile  # mamba/micromamba are faster
+conda create -n hostile -c conda-forge -c bioconda hostile  # Mamba/Micromamba are faster
 conda activate hostile
 ```
 
@@ -50,7 +48,7 @@ pip install hostile  # Requires python >= 3.10
 ```bash
 git clone https://github.com/bede/hostile.git
 cd hostile
-conda env create -f environment.yml  # Mamba/micromamba are faster
+conda env create -f environment.yml  # Mamba/Micromamba are faster
 conda activate hostile
 pip install --editable '.[dev]'
 pytest

diff --git a/src/hostile/__init__.py b/src/hostile/__init__.py
@@ -1,2 +1,2 @@
 """Accurate host read removal"""
-__version__ = "0.0.3"
+__version__ = "0.1.0"
diff --git a/src/hostile/aligner.py b/src/hostile/aligner.py
@@ -1,5 +1,7 @@
 import logging
+import shutil
 import subprocess
+import tempfile
 
 from dataclasses import dataclass
 from pathlib import Path
@@ -13,43 +15,48 @@ class Aligner:
     short_name: str
     bin_path: Path
     cdn_base_url: str
-    working_dir: Path
+    data_dir: Path
     cmd: str
     paired_cmd: str
     idx_archive_fn: str = ""
     ref_archive_fn: str = ""
     idx_name: str = ""
-    idx_paths: tuple[Path] = tuple()
+    idx_paths: tuple[Path, ...] = tuple()
 
     def __post_init__(self):
         self.ref_archive_url = f"{self.cdn_base_url}/{self.ref_archive_fn}"
         self.idx_archive_url = f"{self.cdn_base_url}/{self.idx_archive_fn}"
-        self.ref_archive_path = self.working_dir / self.ref_archive_fn
-        self.idx_archive_path = self.working_dir / self.idx_archive_fn
-        self.idx_path = self.working_dir / self.idx_name
-        Path(self.working_dir).mkdir(exist_ok=True, parents=True)
+        self.ref_archive_path = self.data_dir / self.ref_archive_fn
+        self.idx_archive_path = self.data_dir / self.idx_archive_fn
+        self.idx_path = self.data_dir / self.idx_name
+        Path(self.data_dir).mkdir(exist_ok=True, parents=True)
 
     def check(self, using_custom_index: bool):
         """Test aligner and check/download a ref/index if necessary"""
         if not using_custom_index:  # Check for and if necessary fetch a genome/index
             if self.name == "Bowtie2":
                 if not all(path.exists() for path in self.idx_paths):
-                    self.working_dir.mkdir(exist_ok=True, parents=True)
-                    logging.info(f"Fetching human index")
-                    util.download(self.idx_archive_url, self.idx_archive_path)
-                    util.untar_file(self.idx_archive_path, self.working_dir)
-                    self.idx_archive_path.unlink()
+                    self.data_dir.mkdir(exist_ok=True, parents=True)
+                    logging.info(f"Fetching human index ({self.idx_archive_url})")
+                    with tempfile.NamedTemporaryFile() as temporary_file:
+                        tmp_path = Path(temporary_file.name)
+                        util.download(self.idx_archive_url, tmp_path)
+                        logging.info("Extracting index…")
+                        util.untar_file(tmp_path, self.data_dir)
                     logging.info(f"Saved human index ({self.idx_path})")
                 else:
                     logging.info(f"Found cached index ({self.idx_path})")
             elif self.name == "Minimap2":
                 if not self.ref_archive_path.exists():
-                    util.download(self.ref_archive_url, self.ref_archive_path)
+                    with tempfile.NamedTemporaryFile(delete=False) as temporary_file:
+                        tmp_path = Path(temporary_file.name)
+                        util.download(self.ref_archive_url, tmp_path)
+                        shutil.move(tmp_path, self.ref_archive_path)
                     logging.info(f"Saved human reference ({self.ref_archive_path})")
                 else:
                     logging.info(f"Found cached reference ({self.ref_archive_path})")
         try:
-            util.run(f"{self.bin_path} --version", cwd=self.working_dir)
+            util.run(f"{self.bin_path} --version", cwd=self.data_dir)
         except subprocess.CalledProcessError:
             logging.warning(f"Failed to execute {self.bin_path}")
             raise RuntimeError(f"Failed to execute {self.bin_path}")
@@ -84,16 +91,17 @@ def gen_clean_cmd(
             "{FASTQ}": str(fastq),
             "{THREADS}": str(threads),
         }
+        alignment_cmd = self.cmd
         for k in cmd_template.keys():
-            self.cmd = self.cmd.replace(k, cmd_template[k])
+            alignment_cmd = alignment_cmd.replace(k, cmd_template[k])
         rename_cmd = (
             ' | awk \'BEGIN{{FS=OFS="\\t"}} {{$1=int((NR+1)/2)" "; print $0}}\''
             if rename
             else ""
         )
         cmd = (
             # Align, stream reads to stdout in SAM format
-            f"{self.cmd}"
+            f"{alignment_cmd}"
             # Count reads in stream before filtering (2048 + 256 = 2304)
             f" | tee >(samtools view -F 2304 -c - > '{count_before_path}')"
             # Discard mapped reads
@@ -105,7 +113,6 @@ def gen_clean_cmd(
             # Stream remaining records into fastq files
             f" | samtools fastq --threads 2 -c 6 -0 '{fastq_out_path}'"
         )
-        logging.debug(f"{cmd}")
         return cmd
 
     def gen_paired_clean_cmd(
@@ -142,16 +149,17 @@ def gen_paired_clean_cmd(
             "{FASTQ2}": str(fastq2),
             "{THREADS}": str(threads),
         }
+        alignment_cmd = self.paired_cmd
         for k in cmd_template.keys():
-            self.paired_cmd = self.paired_cmd.replace(k, cmd_template[k])
+            alignment_cmd = alignment_cmd.replace(k, cmd_template[k])
         rename_cmd = (
             f' | awk \'BEGIN{{FS=OFS="\\t"}} {{$1=int((NR+1)/2)" "; print $0}}\''
             if rename
             else ""
         )
         cmd = (
             # Align, stream reads to stdout in SAM format
-            f"{self.paired_cmd}"
+            f"{alignment_cmd}"
             # Count reads in stream before filtering (2048 + 256 = 2304)
             f" | tee >(samtools view -F 2304 -c - > '{count_before_path}')"
             # Discard mapped reads and reads with mapped mates
@@ -163,5 +171,4 @@ def gen_paired_clean_cmd(
             # Stream remaining records into fastq files
             f" | samtools fastq --threads 2 -c 6 -N -1 '{fastq1_out_path}' -2 '{fastq2_out_path}'"
         )
-        logging.debug(f"{cmd}")
         return cmd
diff --git a/src/hostile/cli.py b/src/hostile/cli.py
@@ -1,4 +1,5 @@
 import json
+import logging
 from enum import Enum
 from pathlib import Path
 
@@ -41,7 +42,8 @@ def clean(
     :arg debug: show debug messages
     """
 
-    # Choose a sensible aligner
+    if debug:
+        logging.getLogger().setLevel(logging.DEBUG)
     aligner_paired = (
         lib.ALIGNER.bowtie2
         if aligner == ALIGNER.auto or aligner == ALIGNER.bowtie2
@@ -52,7 +54,6 @@ def clean(
         if aligner == ALIGNER.auto or aligner == ALIGNER.minimap2
         else lib.ALIGNER.bowtie2
     )
-
     if fastq2:
         stats = lib.clean_paired_fastqs(
             [(fastq1, fastq2)],
@@ -76,6 +77,29 @@ def clean(
     print(json.dumps(stats, indent=4))
 
 
+def mask(
+    reference: Path, target: Path, out_dir: Path = Path("masked"), threads: int = 1
+) -> None:
+    """
+    Mask reference genome against target genome[s]
+
+    :arg reference: path to reference genome in fasta[.gz] format
+    :arg target: path to target genome(s) in fasta[.gz] format
+    :arg out_dir: path to output directory
+    :arg threads: number of threads to use
+    """
+    lib.mask(reference=reference, target=target, out_dir=out_dir, threads=threads)
+
+
+def main():
+    defopt.run(
+        {"clean": clean, "mask": mask},
+        no_negated_flags=True,
+        strict_kwonly=False,
+        short={},
+    )
+
+
 # def clean_many(
 #     *fastqs: str,
 #     aligner: lib.ALIGNER = lib.ALIGNER.bowtie2,
@@ -109,26 +133,3 @@ def clean(
 #         raise NotImplementedError(
 #             "Forward and reverse fastq(.gz) paths should be separated with a comma"
 #         )
-
-
-def mask(
-    reference: Path, target: Path, out_dir: Path = Path("masked"), threads: int = 1
-) -> None:
-    """
-    Mask reference genome against target genome[s]
-
-    :arg reference: path to reference genome in fasta[.gz] format
-    :arg target: path to target genome(s) in fasta[.gz] format
-    :arg out_dir: path to output directory
-    :arg threads: number of threads to use
-    """
-    lib.mask(reference=reference, target=target, out_dir=out_dir, threads=threads)
-
-
-def main():
-    defopt.run(
-        {"clean": clean, "mask": mask},
-        no_negated_flags=True,
-        strict_kwonly=False,
-        short={},
-    )