Skip to content

Commit

Permalink
0.1.0; fixes issues 10,19,20,22,23,25
Browse files Browse the repository at this point in the history
  • Loading branch information
bede committed Jul 23, 2023
1 parent 8116618 commit 50b9f53
Show file tree
Hide file tree
Showing 28 changed files with 365 additions and 234 deletions.
22 changes: 10 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,20 +1,18 @@
[![Tests](https://github.com/bede/hostile/actions/workflows/test.yml/badge.svg)](https://github.com/bede/hostile/actions/workflows/test.yml) [![PyPI version](https://img.shields.io/pypi/v/hostile)](https://pypi.org/project/hostile/) [![Bioconda version](https://anaconda.org/bioconda/hostile/badges/version.svg)](https://anaconda.org/bioconda/hostile/) [![Install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat-square&logo=anaconda)](https://biocontainers.pro/tools/hostile) [![Install with Docker](https://img.shields.io/badge/install%20with-docker-important.svg?style=flat-square&logo=docker)](https://biocontainers.pro/tools/hostile) [![DOI:10.1101/2023.07.04.547735](http://img.shields.io/badge/BioRxiv-10.1101/2023.07.04.547735-bd2736.svg)](https://doi.org/10.1101/2023.07.04.547735)
[![Tests](https://github.com/bede/hostile/actions/workflows/test.yml/badge.svg)](https://github.com/bede/hostile/actions/workflows/test.yml) [![PyPI version](https://img.shields.io/pypi/v/hostile)](https://pypi.org/project/hostile/) [![Bioconda version](https://anaconda.org/bioconda/hostile/badges/version.svg)](https://anaconda.org/bioconda/hostile/) [![Install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat-square&logo=anaconda)](https://biocontainers.pro/tools/hostile) [![Install with Docker](https://img.shields.io/badge/install%20with-docker-important.svg?style=flat-square&logo=docker)](https://biocontainers.pro/tools/hostile) [![DOI:10.1101/2023.07.04.547735](http://img.shields.io/badge/BioRxiv-10.1101/2023.07.04.547735-bd2736.svg)](https://www.biorxiv.org/content/10.1101/2023.07.04.547735)

# Hostile

Hostile removes host sequences from short and long reads, consuming paired or unpaired `fastq[.gz]` input. Batteries are included – Hostile downloads and saves a human T2T-CHM13v2.0 + HLA reference when run for the first time. Read headers can be replaced with integers (using `--rename`) for privacy and more compressible FASTQs. Hostile is implemented as a Python package with a CLI and Python API, but heavy lifting is all done by compiled code (Minimap2/Bowtie2 and Samtools). Bowtie2 is the default and recommended aligner for short (paired) reads while Minimap2 is default and recommended aligner for long reads. When used with a masked reference genome, Hostile achieves near-perfect retention of microbial reads while removing >99.5% of human reads. Please read the [BioRxiv preprint](https://www.biorxiv.org/content/10.1101/2023.07.04.547735) for further information and open a GitHub issue, [tweet](https://twitter.com/beconsta) or [toot](https://mstdn.science/@bede) me to report issues or suggest improvements.


Hostile removes host sequences from short and long reads, consuming paired or unpaired `fastq[.gz]` input. Batteries are included – a human reference genome is downloaded when run for the first time. For maximum retention of microbial reads, an existing masked reference genome can be downloaded, or a new one created for target organisms. When used with a masked reference genome, Hostile achieves near-perfect retention of microbial reads while removing >99.6% of human reads. Read headers can be replaced with integers (using `--rename`) for privacy and smaller FASTQs. Heavy lifting is done with fast existing tools (Minimap2/Bowtie2 and Samtools). Bowtie2 is the default aligner for short (paired) reads while Minimap2 is default aligner for long reads. Benchmarks and further info can be found in the [BioRxiv preprint](https://www.biorxiv.org/content/10.1101/2023.07.04.547735) (please cite if useful!). Feel free open an issue, [tweet](https://twitter.com/beconsta) or [toot](https://mstdn.science/@bede) me to report problems or suggest improvements.

## Reference genomes

The default `human-t2t-hla` reference is downloaded when running Hostile for the first time. This can be overriden by specifying a custom `--index`. Bowtie2 indexes need to be untarred before use. The databases `human-t2t-hla` and `human-t2t-hla-argos985-mycob140` were compared in the [paper](https://www.biorxiv.org/content/10.1101/2023.07.04.547735).
The default `human-t2t-hla` reference is downloaded when running Hostile for the first time. This can be overriden by specifying a custom `--index`. Bowtie2 indexes need to be untarred before use. The databases `human-t2t-hla` and `human-t2t-hla-argos985-mycob140` were compared in the [paper](https://www.biorxiv.org/content/10.1101/2023.07.04.547735).

| Name | Composition | Genome (Minimap2) | Bowtie2 index |
| :-------------------------------: | :----------------------------------------------------------: | :----------------------------------------------------------: | :----------------------------------------------------------: |
| `human-t2t-hla` **(default)** | [T2T-CHM13v2.0](https://www.ncbi.nlm.nih.gov/assembly/11828891) + [IPD-IMGT/HLA](https://www.ebi.ac.uk/ipd/imgt/hla/) v3.51 | [human-t2t-hla.fa.gz](https://objectstorage.uk-london-1.oraclecloud.com/n/lrbvkel2wjot/b/human-genome-bucket/o/human-t2t-hla.fa.gz) | [human-t2t-hla.tar](https://objectstorage.uk-london-1.oraclecloud.com/n/lrbvkel2wjot/b/human-genome-bucket/o/human-t2t-hla.tar) |
| `human-t2t-hla-argos985` | [T2T-CHM13v2.0](https://www.ncbi.nlm.nih.gov/assembly/11828891) & [IPD-IMGT/HLA](https://www.ebi.ac.uk/ipd/imgt/hla/) v3.51; masked with [985](https://github.com/bede/hostile/blob/main/paper/supplementary-table-2.tsv) [FDA-ARGOS](https://www.ncbi.nlm.nih.gov/bioproject/231221) 150mers | [human-t2t-hla-argos985.fa.gz](https://objectstorage.uk-london-1.oraclecloud.com/n/lrbvkel2wjot/b/human-genome-bucket/o/human-t2t-hla-argos985.fa.gz) | [human-t2t-hla-argos985.tar](https://objectstorage.uk-london-1.oraclecloud.com/n/lrbvkel2wjot/b/human-genome-bucket/o/human-t2t-hla-argos985.tar) |
| `human-t2t-hla-argos985-mycob140` | [T2T-CHM13v2.0](https://www.ncbi.nlm.nih.gov/assembly/11828891) & [IPD-IMGT/HLA](https://www.ebi.ac.uk/ipd/imgt/hla/) v3.51; masked with [985](https://github.com/bede/hostile/blob/main/paper/supplementary-table-2.tsv) [FDA-ARGOS](https://www.ncbi.nlm.nih.gov/bioproject/231221) & [140](https://github.com/bede/hostile/blob/main/paper/supplementary-table-2.tsv) mycobacterial 150mers | [human-t2t-hla-argos985-mycob140.fa.gz](https://objectstorage.uk-london-1.oraclecloud.com/n/lrbvkel2wjot/b/human-genome-bucket/o/human-t2t-hla-argos985-mycob140.fa.gz) | [human-t2t-hla-argos985-mycob140.tar](https://objectstorage.uk-london-1.oraclecloud.com/n/lrbvkel2wjot/b/human-genome-bucket/o/human-t2t-hla-argos985-mycob140.tar) |
| Name | Composition | Genome (Minimap2) | Bowtie2 index | Date |
| :-------------------------------: | :----------------------------------------------------------: | :----------------------------------------------------------: | :----------------------------------------------------------: | ------- |
| `human-t2t-hla` **(default)** | [T2T-CHM13v2.0](https://www.ncbi.nlm.nih.gov/assembly/11828891) + [IPD-IMGT/HLA](https://www.ebi.ac.uk/ipd/imgt/hla/) v3.51 | [human-t2t-hla.fa.gz](https://objectstorage.uk-london-1.oraclecloud.com/n/lrbvkel2wjot/b/human-genome-bucket/o/human-t2t-hla.fa.gz) | [human-t2t-hla.tar](https://objectstorage.uk-london-1.oraclecloud.com/n/lrbvkel2wjot/b/human-genome-bucket/o/human-t2t-hla.tar) | 2023-07 |
| `human-t2t-hla-argos985` | [T2T-CHM13v2.0](https://www.ncbi.nlm.nih.gov/assembly/11828891) & [IPD-IMGT/HLA](https://www.ebi.ac.uk/ipd/imgt/hla/) v3.51; masked with [985](https://github.com/bede/hostile/blob/main/paper/supplementary-table-2.tsv) [FDA-ARGOS](https://www.ncbi.nlm.nih.gov/bioproject/231221) 150mers | [human-t2t-hla-argos985.fa.gz](https://objectstorage.uk-london-1.oraclecloud.com/n/lrbvkel2wjot/b/human-genome-bucket/o/human-t2t-hla-argos985.fa.gz) | [human-t2t-hla-argos985.tar](https://objectstorage.uk-london-1.oraclecloud.com/n/lrbvkel2wjot/b/human-genome-bucket/o/human-t2t-hla-argos985.tar) | 2023-07 |
| `human-t2t-hla-argos985-mycob140` | [T2T-CHM13v2.0](https://www.ncbi.nlm.nih.gov/assembly/11828891) & [IPD-IMGT/HLA](https://www.ebi.ac.uk/ipd/imgt/hla/) v3.51; masked with [985](https://github.com/bede/hostile/blob/main/paper/supplementary-table-2.tsv) [FDA-ARGOS](https://www.ncbi.nlm.nih.gov/bioproject/231221) & [140](https://github.com/bede/hostile/blob/main/paper/supplementary-table-2.tsv) mycobacterial 150mers | [human-t2t-hla-argos985-mycob140.fa.gz](https://objectstorage.uk-london-1.oraclecloud.com/n/lrbvkel2wjot/b/human-genome-bucket/o/human-t2t-hla-argos985-mycob140.fa.gz) | [human-t2t-hla-argos985-mycob140.tar](https://objectstorage.uk-london-1.oraclecloud.com/n/lrbvkel2wjot/b/human-genome-bucket/o/human-t2t-hla-argos985-mycob140.tar) | 2023-07 |



Expand All @@ -25,7 +23,7 @@ Installation with conda/mamba or Docker is recommended due to non-Python depende
**Conda/mamba**

```bash
conda create -n hostile -c conda-forge -c bioconda hostile # mamba/micromamba are faster
conda create -n hostile -c conda-forge -c bioconda hostile # Mamba/Micromamba are faster
conda activate hostile
```

Expand All @@ -50,7 +48,7 @@ pip install hostile # Requires python >= 3.10
```bash
git clone https://github.com/bede/hostile.git
cd hostile
conda env create -f environment.yml # Mamba/micromamba are faster
conda env create -f environment.yml # Mamba/Micromamba are faster
conda activate hostile
pip install --editable '.[dev]'
pytest
Expand Down
2 changes: 1 addition & 1 deletion src/hostile/__init__.py
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
"""Accurate host read removal"""
__version__ = "0.0.3"
__version__ = "0.1.0"
45 changes: 26 additions & 19 deletions src/hostile/aligner.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
import logging
import shutil
import subprocess
import tempfile

from dataclasses import dataclass
from pathlib import Path
Expand All @@ -13,43 +15,48 @@ class Aligner:
short_name: str
bin_path: Path
cdn_base_url: str
working_dir: Path
data_dir: Path
cmd: str
paired_cmd: str
idx_archive_fn: str = ""
ref_archive_fn: str = ""
idx_name: str = ""
idx_paths: tuple[Path] = tuple()
idx_paths: tuple[Path, ...] = tuple()

def __post_init__(self):
self.ref_archive_url = f"{self.cdn_base_url}/{self.ref_archive_fn}"
self.idx_archive_url = f"{self.cdn_base_url}/{self.idx_archive_fn}"
self.ref_archive_path = self.working_dir / self.ref_archive_fn
self.idx_archive_path = self.working_dir / self.idx_archive_fn
self.idx_path = self.working_dir / self.idx_name
Path(self.working_dir).mkdir(exist_ok=True, parents=True)
self.ref_archive_path = self.data_dir / self.ref_archive_fn
self.idx_archive_path = self.data_dir / self.idx_archive_fn
self.idx_path = self.data_dir / self.idx_name
Path(self.data_dir).mkdir(exist_ok=True, parents=True)

def check(self, using_custom_index: bool):
"""Test aligner and check/download a ref/index if necessary"""
if not using_custom_index: # Check for and if necessary fetch a genome/index
if self.name == "Bowtie2":
if not all(path.exists() for path in self.idx_paths):
self.working_dir.mkdir(exist_ok=True, parents=True)
logging.info(f"Fetching human index")
util.download(self.idx_archive_url, self.idx_archive_path)
util.untar_file(self.idx_archive_path, self.working_dir)
self.idx_archive_path.unlink()
self.data_dir.mkdir(exist_ok=True, parents=True)
logging.info(f"Fetching human index ({self.idx_archive_url})")
with tempfile.NamedTemporaryFile() as temporary_file:
tmp_path = Path(temporary_file.name)
util.download(self.idx_archive_url, tmp_path)
logging.info("Extracting index…")
util.untar_file(tmp_path, self.data_dir)
logging.info(f"Saved human index ({self.idx_path})")
else:
logging.info(f"Found cached index ({self.idx_path})")
elif self.name == "Minimap2":
if not self.ref_archive_path.exists():
util.download(self.ref_archive_url, self.ref_archive_path)
with tempfile.NamedTemporaryFile(delete=False) as temporary_file:
tmp_path = Path(temporary_file.name)
util.download(self.ref_archive_url, tmp_path)
shutil.move(tmp_path, self.ref_archive_path)
logging.info(f"Saved human reference ({self.ref_archive_path})")
else:
logging.info(f"Found cached reference ({self.ref_archive_path})")
try:
util.run(f"{self.bin_path} --version", cwd=self.working_dir)
util.run(f"{self.bin_path} --version", cwd=self.data_dir)
except subprocess.CalledProcessError:
logging.warning(f"Failed to execute {self.bin_path}")
raise RuntimeError(f"Failed to execute {self.bin_path}")
Expand Down Expand Up @@ -84,16 +91,17 @@ def gen_clean_cmd(
"{FASTQ}": str(fastq),
"{THREADS}": str(threads),
}
alignment_cmd = self.cmd
for k in cmd_template.keys():
self.cmd = self.cmd.replace(k, cmd_template[k])
alignment_cmd = alignment_cmd.replace(k, cmd_template[k])
rename_cmd = (
' | awk \'BEGIN{{FS=OFS="\\t"}} {{$1=int((NR+1)/2)" "; print $0}}\''
if rename
else ""
)
cmd = (
# Align, stream reads to stdout in SAM format
f"{self.cmd}"
f"{alignment_cmd}"
# Count reads in stream before filtering (2048 + 256 = 2304)
f" | tee >(samtools view -F 2304 -c - > '{count_before_path}')"
# Discard mapped reads
Expand All @@ -105,7 +113,6 @@ def gen_clean_cmd(
# Stream remaining records into fastq files
f" | samtools fastq --threads 2 -c 6 -0 '{fastq_out_path}'"
)
logging.debug(f"{cmd}")
return cmd

def gen_paired_clean_cmd(
Expand Down Expand Up @@ -142,16 +149,17 @@ def gen_paired_clean_cmd(
"{FASTQ2}": str(fastq2),
"{THREADS}": str(threads),
}
alignment_cmd = self.paired_cmd
for k in cmd_template.keys():
self.paired_cmd = self.paired_cmd.replace(k, cmd_template[k])
alignment_cmd = alignment_cmd.replace(k, cmd_template[k])
rename_cmd = (
f' | awk \'BEGIN{{FS=OFS="\\t"}} {{$1=int((NR+1)/2)" "; print $0}}\''
if rename
else ""
)
cmd = (
# Align, stream reads to stdout in SAM format
f"{self.paired_cmd}"
f"{alignment_cmd}"
# Count reads in stream before filtering (2048 + 256 = 2304)
f" | tee >(samtools view -F 2304 -c - > '{count_before_path}')"
# Discard mapped reads and reads with mapped mates
Expand All @@ -163,5 +171,4 @@ def gen_paired_clean_cmd(
# Stream remaining records into fastq files
f" | samtools fastq --threads 2 -c 6 -N -1 '{fastq1_out_path}' -2 '{fastq2_out_path}'"
)
logging.debug(f"{cmd}")
return cmd
51 changes: 26 additions & 25 deletions src/hostile/cli.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
import json
import logging
from enum import Enum
from pathlib import Path

Expand Down Expand Up @@ -41,7 +42,8 @@ def clean(
:arg debug: show debug messages
"""

# Choose a sensible aligner
if debug:
logging.getLogger().setLevel(logging.DEBUG)
aligner_paired = (
lib.ALIGNER.bowtie2
if aligner == ALIGNER.auto or aligner == ALIGNER.bowtie2
Expand All @@ -52,7 +54,6 @@ def clean(
if aligner == ALIGNER.auto or aligner == ALIGNER.minimap2
else lib.ALIGNER.bowtie2
)

if fastq2:
stats = lib.clean_paired_fastqs(
[(fastq1, fastq2)],
Expand All @@ -76,6 +77,29 @@ def clean(
print(json.dumps(stats, indent=4))


def mask(
reference: Path, target: Path, out_dir: Path = Path("masked"), threads: int = 1
) -> None:
"""
Mask reference genome against target genome[s]
:arg reference: path to reference genome in fasta[.gz] format
:arg target: path to target genome(s) in fasta[.gz] format
:arg out_dir: path to output directory
:arg threads: number of threads to use
"""
lib.mask(reference=reference, target=target, out_dir=out_dir, threads=threads)


def main():
defopt.run(
{"clean": clean, "mask": mask},
no_negated_flags=True,
strict_kwonly=False,
short={},
)


# def clean_many(
# *fastqs: str,
# aligner: lib.ALIGNER = lib.ALIGNER.bowtie2,
Expand Down Expand Up @@ -109,26 +133,3 @@ def clean(
# raise NotImplementedError(
# "Forward and reverse fastq(.gz) paths should be separated with a comma"
# )


def mask(
reference: Path, target: Path, out_dir: Path = Path("masked"), threads: int = 1
) -> None:
"""
Mask reference genome against target genome[s]
:arg reference: path to reference genome in fasta[.gz] format
:arg target: path to target genome(s) in fasta[.gz] format
:arg out_dir: path to output directory
:arg threads: number of threads to use
"""
lib.mask(reference=reference, target=target, out_dir=out_dir, threads=threads)


def main():
defopt.run(
{"clean": clean, "mask": mask},
no_negated_flags=True,
strict_kwonly=False,
short={},
)
Loading

0 comments on commit 50b9f53

Please sign in to comment.